---------- Forwarded message ---------- From: Lars Aronsson lars@aronsson.se Date: 06-Sep-2007 21:32 Subject: [Wikitech-l] Statistics on templates and references To: wikitech-l@lists.wikimedia.org
A year ago, I wrote a little script for extracting template calls from the XML database dump. The idea is that many templates are infoboxes that provide structured information, such as the population density of a country or bibliographic information in book citations. The script is now updated to also extract ISBNs and <ref> tags, as if these had been templates.
http://meta.wikimedia.org/wiki/User:LA2/Extraktor
I downloaded the reasonably small Wikipedia dumps for the Scandinavian and Baltic languages and compiled some statistics, such as the 50 most used templates, the 20 most cited ISBNs and the 15 most common things to find inside <ref> tags.
http://meta.wikimedia.org/wiki/User:LA2/Extraktor_stats_200709
Of these languages, Swedish is the biggest (the uncompressed database dump is 600 MB) followed by Finnish (481 MB) and Norwegian (415 MB). But Finnish is far ahead in the use of references and templates. One way to describe this degree of structure is the size of my script's output compared to its input:
Language Dump size Extraktor output ----------------- --------- ---------------- lt = Lithuanian 152 MB 18.4 % or 28 MB no = Norwegian 415 MB 16.9 % nn = Nynorsk 85 MB 15.3 % fi = Finnish 481 MB 14.1 % is = Icelandic 66 MB 12.7 % se = Sami 5.1 MB 10.8 % da = Danish 209 MB 10.5 % sv = Swedish 600 MB 10.2 % fo = Faroese 7.8 MB 8.9 % et = Estonian 116 MB 8.3 % lv = Latvian 45 MB 8.2 % fiu-vro = Võro 3.5 MB 6.4 %
I can't fully explain why the Lithuanian WP ranks so high. Perhaps there is an opening <ref> that doesn't close, causing many bytes to be included? If so, my script could help to find and hunt down such errors. (I also tried the Yiddish Wikipedia and got an even higher ranking, but I can't understand anything of that language, so I'm totally clueless.)
And the ranking doesn't quite capture the fact that the Finnish Wikipedia contains 59365 <ref> tags and 15108 ISBNs, while Swedish has 28956 and 10742, respectively, and the Norwegian 19078 and 9060. The main difference seems to be the "good" examples above 12% and the laggards below 12%. Swedish and Danish should learn from Norwegian and Finnish.
My conclusions are not final. The message is that the script exists, and you are all free to help in digging out interesting information.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wiki-research-l@lists.wikimedia.org