In Wiktionary, every site/language documents words from every language,
as I am sure you know. A typical wiki page, e.g. "war" contains information
about the English noun as well as the German verb.
Through categories, we also know how many entries there are. How many
English lemmas, how many English nouns, how many German verbs.
But if I want to plot a graph of the growth over time of English nouns
and German verbs, it is a pity that this is not available anywhere.
But it would be possible to generate such data from the history
dump, by finding out when the page "war" was created and when its
English and German sections were created. In SQL terms, it would be
for each combination of page and section (heading), find the earliest
date when that section was present in that page. But a practical
implementation would of course solve that as a single-pass filter,
reading the stdout from bunzip.
So has anybody already written a program that reads through the
XML dump of articles and their history, and generates statistics
of this kind?
--
Lars Aronsson (lars(a)aronsson.se)
Linköping, Sweden