I don't know when I first heard the plans for wikidata, but one
year ago I proposed a more light-weight alternative approach on
the [[m:talk:Wikidata]] page. Then nothing happened, and today I
implemented it. It's 98 lines of Perl that processes an XML dump
and extracts template call parameter values. The source code can
be had from
http://meta.wikimedia.org/wiki/User:LA2/Extraktor
The SQL dump of the templatelinks table already tells us which
pages calls which templates. This script goes beyond that to get
information about each individual call parameter.
The output format is a very simple awk-friendly text file. For
example, the German Wikipedia page [[de:Anthony Hope]] contains
the two template calls
{{PND|11901842X}}
{{Personendaten|
NAME=Hope, Anthony
|ALTERNATIVNAMEN=Hawkins, Anthony Hope
|KURZBESCHREIBUNG=englischer [[Rechtsanwalt]] und [[Autor]]
|GEBURTSDATUM=[[9. Februar]] [[1863]]
|GEBURTSORT=[[London]]
|STERBEDATUM=[[8. Juli]] [[1933]]
|STERBEORT=
}}
For this page, the output contains:
PND|Anthony Hope|1|1|11901842X
Personendaten|Anthony Hope|2|NAME|Hope, Anthony
Personendaten|Anthony Hope|2|ALTERNATIVNAMEN|Hawkins, Anthony Hope
Personendaten|Anthony Hope|2|KURZBESCHREIBUNG|englischer [[Rechtsanwalt]] und
[[Autor]]
Personendaten|Anthony Hope|2|GEBURTSDATUM|[[9. Februar]] [[1863]]
Personendaten|Anthony Hope|2|GEBURTSORT|[[London]]
Personendaten|Anthony Hope|2|STERBEDATUM|[[8. Juli]] [[1933]]
Personendaten|Anthony Hope|2|STERBEORT|
As you can see, the |-separated fields are:
1. Name of the template called
2. Name of the page that called the template
3. Sequence number of this call within the page
4. Name or position number of the parameter
5. Value of the parameter
The output for the entire German Wikipedia dump is
bunzip2 <dewiki-20060803-pages-articles.xml.bz2 |
perl extraktor.pl >de.params
du -sm de.params
123 megabytes. With some simple awk, I get the following
statistics: There are
awk '-F|' '{print $2,$3}' de.params | sort -u | wc -l
790,985 template calls using a total of
wc -l de.params
2,076,178 parameters (on average 2.62 parameters per call) from
awk '-F|' '{print $2}' de.params | sort -u | wc -l
397,929 different pages to
awk '-F|' '{print $1}' de.params | sort -u | wc -l
13,295 different templates. The most commonly supplied parameter
names over all templates are
awk '-F|' '{print $4}' de.params | sort | uniq -c | sort -nr
NAME (113038 occurances), ALTERNATIVNAMEN (101799),
KURZBESCHREIBUNG (101723), GEBURTSORT (101706), GEBURTSDATUM
(101704), STERBEDATUM (101663), STERBEORT (101649), ID (10061),
ZEIT (6255), VORGÄNGER (6210), NACHFOLGER (6210), AMT (6199),
EINWOHNER (5942), FLÄCHE (5868), WEBSITE (5680), STAND_EINWOHNER
(5619), Name (5307), PJ (5242), PL (5240), LEN (5224), DS (5219),
OS (5214), OT (5152), MUSIK (5137), DT (5117), TITEL (4750),
INHALT (4739), PRO (4658), REG (4639), DRB (4627), AF (4568),
KAMERA (4557), SCHNITT (4516), Bild (4038), BILD (3562), PLZ
(3344), HÖHE (3265), GEMEINDEART (3180), BREITENGRAD (3074),
LÄNGENGRAD (3061), KANTON (3013), and NAME_ORT (3005).
Yes, the bad taste of all-caps parameter names is a disease of the
German Wikipedia since the early days of the Personendaten
project. Personendaten is also the template that is called from
100,000 different pages. Let's see which templates use the
parameter named GEMEINDEART (kind of municipality):
awk '-F|' '$4 == "GEMEINDEART" {print $1}' de.templates |
sort | uniq -c | sort -nr
Ort_Schweiz (2738 calls), Ortschaft_Schweiz (196),
Infobox_Slowakische_Gemeinde-K (121), Infobox_Slowakische_Gemeinde
(111), Ort_Liechtenstein (11), Infobox_Schweizer_Gemeinden (2),
Infobox_Deutsche_Städte (1).
Let's see which kinds of municipalities there are in Slovakia:
awk '-F|' '$1 == "Infobox_Slowakische_Gemeinde" &&
$4 == "GEMEINDEART" {print $5}' de.templates |
sort | uniq -c | sort -nr
Stadt (74), Stadtteil (21), Gemeinde (16).
And in Switzerland:
awk '-F|' '$1 == "Ort_Schweiz" &&
$4 == "GEMEINDEART" {print $5}' de.templates |
sort | uniq -c | sort -nr
Gemeinde (2591), Stadt (126), Gemeinden (12).
Perhaps "Gemeinden" (a plural) is an error that should be fixed?
Let's see which twelve pages use this value for this parameter to
this template:
awk '-F|' '$1 == "Ort_Schweiz" &&
$4 == "GEMEINDEART" &&
$5 == "Gemeinden" {print $2}' de.templates
Benken ZH, Flaach, Adlikon bei Andelfingen, Andelfingen ZH,
Berg am Irchel, Buch am Irchel, Dachsen, Dorf ZH, Feuerthalen,
Humlikon, Flurlingen, Henggart.
Hmm... It turns out that GEMEINDEART is not used in this infobox
template. That's odd. I'll leave it there.
I hope you get the point. Of course you can use your favorite
SQL database instead of awk. If you want speed, be sure to create
indexes for every column.
Imagine if there was a templateparameter table supported by
Mediawiki, then we could do this in real time. I'm wysiwyg
filling out an infobox template here. Which parameter names
should I supply? Which values should I typically use?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se