Hi Lars,
https://dkpro.github.io/dkpro-jwktl/ might be a good starting point for you. It does not solve all steps of your use case right away, but you could save a lot of implementation time compared to starting from scratch. The software is written in Java.
Best, Christian
-----Original Message----- From: Wiktionary-l [mailto:wiktionary-l-bounces@lists.wikimedia.org] On Behalf Of Lars Aronsson Sent: Friday, September 08, 2017 8:56 PM To: Wikimedia developers Cc: Wiktionary Subject: [Wiktionary-l] Historic stats
In Wiktionary, every site/language documents words from every language, as I am sure you know. A typical wiki page, e.g. "war" contains information about the English noun as well as the German verb.
Through categories, we also know how many entries there are. How many English lemmas, how many English nouns, how many German verbs.
But if I want to plot a graph of the growth over time of English nouns and German verbs, it is a pity that this is not available anywhere. But it would be possible to generate such data from the history dump, by finding out when the page "war" was created and when its English and German sections were created. In SQL terms, it would be for each combination of page and section (heading), find the earliest date when that section was present in that page. But a practical implementation would of course solve that as a single-pass filter, reading the stdout from bunzip.
So has anybody already written a program that reads through the XML dump of articles and their history, and generates statistics of this kind?