I don't know when I first heard the plans for wikidata, but one year ago I proposed a more light-weight alternative approach on the [[m:talk:Wikidata]] page. Then nothing happened, and today I implemented it. It's 98 lines of Perl that processes an XML dump and extracts template call parameter values. The source code can be had from http://meta.wikimedia.org/wiki/User:LA2/Extraktor
The SQL dump of the templatelinks table already tells us which pages calls which templates. This script goes beyond that to get information about each individual call parameter.
The output format is a very simple awk-friendly text file. For example, the German Wikipedia page [[de:Anthony Hope]] contains the two template calls
{{PND|11901842X}} {{Personendaten| NAME=Hope, Anthony |ALTERNATIVNAMEN=Hawkins, Anthony Hope |KURZBESCHREIBUNG=englischer [[Rechtsanwalt]] und [[Autor]] |GEBURTSDATUM=[[9. Februar]] [[1863]] |GEBURTSORT=[[London]] |STERBEDATUM=[[8. Juli]] [[1933]] |STERBEORT= }}
For this page, the output contains:
PND|Anthony Hope|1|1|11901842X Personendaten|Anthony Hope|2|NAME|Hope, Anthony Personendaten|Anthony Hope|2|ALTERNATIVNAMEN|Hawkins, Anthony Hope Personendaten|Anthony Hope|2|KURZBESCHREIBUNG|englischer [[Rechtsanwalt]] und [[Autor]] Personendaten|Anthony Hope|2|GEBURTSDATUM|[[9. Februar]] [[1863]] Personendaten|Anthony Hope|2|GEBURTSORT|[[London]] Personendaten|Anthony Hope|2|STERBEDATUM|[[8. Juli]] [[1933]] Personendaten|Anthony Hope|2|STERBEORT|
As you can see, the |-separated fields are:
1. Name of the template called 2. Name of the page that called the template 3. Sequence number of this call within the page 4. Name or position number of the parameter 5. Value of the parameter
The output for the entire German Wikipedia dump is
bunzip2 <dewiki-20060803-pages-articles.xml.bz2 | perl extraktor.pl >de.params du -sm de.params
123 megabytes. With some simple awk, I get the following statistics: There are
awk '-F|' '{print $2,$3}' de.params | sort -u | wc -l
790,985 template calls using a total of
wc -l de.params
2,076,178 parameters (on average 2.62 parameters per call) from
awk '-F|' '{print $2}' de.params | sort -u | wc -l
397,929 different pages to
awk '-F|' '{print $1}' de.params | sort -u | wc -l
13,295 different templates. The most commonly supplied parameter names over all templates are
awk '-F|' '{print $4}' de.params | sort | uniq -c | sort -nr
NAME (113038 occurances), ALTERNATIVNAMEN (101799), KURZBESCHREIBUNG (101723), GEBURTSORT (101706), GEBURTSDATUM (101704), STERBEDATUM (101663), STERBEORT (101649), ID (10061), ZEIT (6255), VORGÄNGER (6210), NACHFOLGER (6210), AMT (6199), EINWOHNER (5942), FLÄCHE (5868), WEBSITE (5680), STAND_EINWOHNER (5619), Name (5307), PJ (5242), PL (5240), LEN (5224), DS (5219), OS (5214), OT (5152), MUSIK (5137), DT (5117), TITEL (4750), INHALT (4739), PRO (4658), REG (4639), DRB (4627), AF (4568), KAMERA (4557), SCHNITT (4516), Bild (4038), BILD (3562), PLZ (3344), HÖHE (3265), GEMEINDEART (3180), BREITENGRAD (3074), LÄNGENGRAD (3061), KANTON (3013), and NAME_ORT (3005).
Yes, the bad taste of all-caps parameter names is a disease of the German Wikipedia since the early days of the Personendaten project. Personendaten is also the template that is called from 100,000 different pages. Let's see which templates use the parameter named GEMEINDEART (kind of municipality):
awk '-F|' '$4 == "GEMEINDEART" {print $1}' de.templates | sort | uniq -c | sort -nr
Ort_Schweiz (2738 calls), Ortschaft_Schweiz (196), Infobox_Slowakische_Gemeinde-K (121), Infobox_Slowakische_Gemeinde (111), Ort_Liechtenstein (11), Infobox_Schweizer_Gemeinden (2), Infobox_Deutsche_Städte (1).
Let's see which kinds of municipalities there are in Slovakia:
awk '-F|' '$1 == "Infobox_Slowakische_Gemeinde" && $4 == "GEMEINDEART" {print $5}' de.templates | sort | uniq -c | sort -nr
Stadt (74), Stadtteil (21), Gemeinde (16).
And in Switzerland:
awk '-F|' '$1 == "Ort_Schweiz" && $4 == "GEMEINDEART" {print $5}' de.templates | sort | uniq -c | sort -nr
Gemeinde (2591), Stadt (126), Gemeinden (12).
Perhaps "Gemeinden" (a plural) is an error that should be fixed? Let's see which twelve pages use this value for this parameter to this template:
awk '-F|' '$1 == "Ort_Schweiz" && $4 == "GEMEINDEART" && $5 == "Gemeinden" {print $2}' de.templates
Benken ZH, Flaach, Adlikon bei Andelfingen, Andelfingen ZH, Berg am Irchel, Buch am Irchel, Dachsen, Dorf ZH, Feuerthalen, Humlikon, Flurlingen, Henggart.
Hmm... It turns out that GEMEINDEART is not used in this infobox template. That's odd. I'll leave it there.
I hope you get the point. Of course you can use your favorite SQL database instead of awk. If you want speed, be sure to create indexes for every column.
Imagine if there was a templateparameter table supported by Mediawiki, then we could do this in real time. I'm wysiwyg filling out an infobox template here. Which parameter names should I supply? Which values should I typically use?
Lars Aronsson schrieb: <snip lots of interesting stuff>
Imagine if there was a templateparameter table supported by Mediawiki, then we could do this in real time.
I remember suggesting that a while ago on this list, so de.wikipedia could have automatic lists of people sorted by (last) name, though I'm sure that would only be one application.
I'm wysiwyg filling out an infobox template here. Which parameter names should I supply? Which values should I typically use?
Actually, we could use a variant of my reference renerator [1] to enter new templates, with needed and optional parameters and per-parameter hints, to insert new templates. I for one always feel creeped out by the species templates on de ;-) and would use such a tool. We'd only have to glue it to the edit page via JavaScript or something.
Magnus
wikitech-l@lists.wikimedia.org