---------- Forwarded message ----------
From: Lars Aronsson <lars(a)aronsson.se>
Date: 06-Sep-2007 21:32
Subject: [Wikitech-l] Statistics on templates and references
To: wikitech-l(a)lists.wikimedia.org
A year ago, I wrote a little script for extracting template calls
from the XML database dump. The idea is that many templates are
infoboxes that provide structured information, such as the
population density of a country or bibliographic information in
book citations. The script is now updated to also extract ISBNs
and <ref> tags, as if these had been templates.
http://meta.wikimedia.org/wiki/User:LA2/Extraktor
I downloaded the reasonably small Wikipedia dumps for the
Scandinavian and Baltic languages and compiled some statistics,
such as the 50 most used templates, the 20 most cited ISBNs and
the 15 most common things to find inside <ref> tags.
http://meta.wikimedia.org/wiki/User:LA2/Extraktor_stats_200709
Of these languages, Swedish is the biggest (the uncompressed
database dump is 600 MB) followed by Finnish (481 MB) and
Norwegian (415 MB). But Finnish is far ahead in the use of
references and templates. One way to describe this degree of
structure is the size of my script's output compared to its input:
Language Dump size Extraktor output
----------------- --------- ----------------
lt = Lithuanian 152 MB 18.4 % or 28 MB
no = Norwegian 415 MB 16.9 %
nn = Nynorsk 85 MB 15.3 %
fi = Finnish 481 MB 14.1 %
is = Icelandic 66 MB 12.7 %
se = Sami 5.1 MB 10.8 %
da = Danish 209 MB 10.5 %
sv = Swedish 600 MB 10.2 %
fo = Faroese 7.8 MB 8.9 %
et = Estonian 116 MB 8.3 %
lv = Latvian 45 MB 8.2 %
fiu-vro = Võro 3.5 MB 6.4 %
I can't fully explain why the Lithuanian WP ranks so high.
Perhaps there is an opening <ref> that doesn't close, causing many
bytes to be included? If so, my script could help to find and
hunt down such errors. (I also tried the Yiddish Wikipedia and
got an even higher ranking, but I can't understand anything of
that language, so I'm totally clueless.)
And the ranking doesn't quite capture the fact that the Finnish
Wikipedia contains 59365 <ref> tags and 15108 ISBNs, while Swedish
has 28956 and 10742, respectively, and the Norwegian 19078 and
9060. The main difference seems to be the "good" examples above
12% and the laggards below 12%. Swedish and Danish should learn
from Norwegian and Finnish.
My conclusions are not final. The message is that the script
exists, and you are all free to help in digging out interesting
information.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/