Template parameter statistics - Wikitech-l

30 Aug 2006


      I don't know when I first heard the plans for wikidata, but one 
year ago I proposed a more light-weight alternative approach on 
the [[m:talk:Wikidata]] page. Then nothing happened, and today I 
implemented it.  It's 98 lines of Perl that processes an XML dump 
and extracts template call parameter values.  The source code can 
be had from http://meta.wikimedia.org/wiki/User:LA2/Extraktor
The SQL dump of the templatelinks table already tells us which 
pages calls which templates.  This script goes beyond that to get 
information about each individual call parameter.
The output format is a very simple awk-friendly text file.  For 
example, the German Wikipedia page [[de:Anthony Hope]] contains 
the two template calls
{{PND|11901842X}}
   {{Personendaten|
    NAME=Hope, Anthony
   |ALTERNATIVNAMEN=Hawkins, Anthony Hope 
   |KURZBESCHREIBUNG=englischer [[Rechtsanwalt]] und [[Autor]]
   |GEBURTSDATUM=[[9. Februar]] [[1863]]
   |GEBURTSORT=[[London]]
   |STERBEDATUM=[[8. Juli]] [[1933]]
   |STERBEORT=
   }}
For this page, the output contains:
PND|Anthony Hope|1|1|11901842X
   Personendaten|Anthony Hope|2|NAME|Hope, Anthony
   Personendaten|Anthony Hope|2|ALTERNATIVNAMEN|Hawkins, Anthony Hope
   Personendaten|Anthony Hope|2|KURZBESCHREIBUNG|englischer [[Rechtsanwalt]] und [[Autor]]
   Personendaten|Anthony Hope|2|GEBURTSDATUM|[[9. Februar]] [[1863]]
   Personendaten|Anthony Hope|2|GEBURTSORT|[[London]]
   Personendaten|Anthony Hope|2|STERBEDATUM|[[8. Juli]] [[1933]]
   Personendaten|Anthony Hope|2|STERBEORT|
As you can see, the |-separated fields are:
1. Name of the template called
 2. Name of the page that called the template
 3. Sequence number of this call within the page
 4. Name or position number of the parameter
 5. Value of the parameter
The output for the entire German Wikipedia dump is
bunzip2 <dewiki-20060803-pages-articles.xml.bz2 |
    perl extraktor.pl >de.params
  du -sm de.params
123 megabytes.  With some simple awk, I get the following 
statistics:  There are
awk '-F|' '{print $2,$3}' de.params | sort -u | wc -l
790,985 template calls using a total of
wc -l de.params
2,076,178 parameters (on average 2.62 parameters per call) from
awk '-F|' '{print $2}' de.params | sort -u | wc -l
397,929 different pages to
awk '-F|' '{print $1}' de.params | sort -u | wc -l
13,295 different templates.  The most commonly supplied parameter 
names over all templates are
awk '-F|' '{print $4}' de.params | sort | uniq -c | sort -nr
NAME (113038 occurances), ALTERNATIVNAMEN (101799), 
KURZBESCHREIBUNG (101723), GEBURTSORT (101706), GEBURTSDATUM 
(101704), STERBEDATUM (101663), STERBEORT (101649), ID (10061), 
ZEIT (6255), VORGÄNGER (6210), NACHFOLGER (6210), AMT (6199), 
EINWOHNER (5942), FLÄCHE (5868), WEBSITE (5680), STAND_EINWOHNER 
(5619), Name (5307), PJ (5242), PL (5240), LEN (5224), DS (5219), 
OS (5214), OT (5152), MUSIK (5137), DT (5117), TITEL (4750), 
INHALT (4739), PRO (4658), REG (4639), DRB (4627), AF (4568), 
KAMERA (4557), SCHNITT (4516), Bild (4038), BILD (3562), PLZ 
(3344), HÖHE (3265), GEMEINDEART (3180), BREITENGRAD (3074), 
LÄNGENGRAD (3061), KANTON (3013), and NAME_ORT (3005).
Yes, the bad taste of all-caps parameter names is a disease of the 
German Wikipedia since the early days of the Personendaten 
project.  Personendaten is also the template that is called from 
100,000 different pages.  Let's see which templates use the 
parameter named GEMEINDEART (kind of municipality):
awk '-F|' '$4 == "GEMEINDEART" {print $1}' de.templates |
    sort | uniq -c | sort -nr
Ort_Schweiz (2738 calls), Ortschaft_Schweiz (196), 
Infobox_Slowakische_Gemeinde-K (121), Infobox_Slowakische_Gemeinde 
(111), Ort_Liechtenstein (11), Infobox_Schweizer_Gemeinden (2), 
Infobox_Deutsche_Städte (1).
Let's see which kinds of municipalities there are in Slovakia:
awk '-F|' '$1 == "Infobox_Slowakische_Gemeinde" &&
             $4 == "GEMEINDEART" {print $5}' de.templates |
    sort | uniq -c | sort -nr
Stadt (74), Stadtteil (21), Gemeinde (16).
And in Switzerland:
awk '-F|' '$1 == "Ort_Schweiz" &&
             $4 == "GEMEINDEART" {print $5}' de.templates |
    sort | uniq -c | sort -nr
Gemeinde (2591), Stadt (126), Gemeinden (12).
Perhaps "Gemeinden" (a plural) is an error that should be fixed?  
Let's see which twelve pages use this value for this parameter to 
this template:
awk '-F|' '$1 == "Ort_Schweiz" &&
             $4 == "GEMEINDEART" &&
             $5 == "Gemeinden" {print $2}' de.templates
Benken ZH, Flaach, Adlikon bei Andelfingen, Andelfingen ZH,
Berg am Irchel, Buch am Irchel, Dachsen, Dorf ZH, Feuerthalen,
Humlikon, Flurlingen, Henggart.
Hmm... It turns out that GEMEINDEART is not used in this infobox 
template.  That's odd.  I'll leave it there.
I hope you get the point.  Of course you can use your favorite 
SQL database instead of awk.  If you want speed, be sure to create 
indexes for every column.
Imagine if there was a templateparameter table supported by 
Mediawiki, then we could do this in real time.  I'm wysiwyg 
filling out an infobox template here.  Which parameter names 
should I supply?  Which values should I typically use?
-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se