My Wiktionary parser is now available vis svn on the toolserver: http://fisheye.ts.wikimedia.org/browse/hippietrail/wiktparser
It's not a full parser yet. I'm developing several reusable libraries and a couple of small apps which use them.
Libraries:
* DumpParser.pm knows about the XML dump file format, including namespaces. * WiktParser.pm knows about parts of how the English Wiktionary articles are formatted. * WiktLang.pm relates language names and synonyms and alternative spellings to language codes.
Apps:
* wiktparser.pl extracts nouns of a given language along with their gender and homonym and sense numbers. It also produces a log file of entries which it could not parse. * extractlangcodes.pl looks for all templates and articles which contain information relating language codes to language names or vice versa and outputs a table of which sets of language names relate to which set of language codes.
Please try out these tools and comment here. I'm actively refactoring and generalizing the code now rather than trying to extract other parts of speech or parse more variants of headword/inflection lines or definition lines.
Andrew Dunbar (hippietrail)