Re: [Wiktionary-l] Wiktionary parsers

20 Nov 2007

My Wiktionary parser is now available vis svn on the toolserver:
http://fisheye.ts.wikimedia.org/browse/hippietrail/wiktparser

It's not a full parser yet. I'm developing several reusable libraries
and a couple of small apps which use them.

Libraries:

* DumpParser.pm knows about the XML dump file format, including namespaces.
* WiktParser.pm knows about parts of how the English Wiktionary
articles are formatted.
* WiktLang.pm relates language names and synonyms and alternative
spellings to language codes.

Apps:

* wiktparser.pl extracts nouns of a given language along with their
gender and homonym and sense numbers. It also produces a log file of
entries which it could not parse.
* extractlangcodes.pl looks for all templates and articles which
contain information relating language codes to language names or vice
versa and outputs a table of which sets of language names relate to
which set of language codes.

Please try out these tools and comment here. I'm actively refactoring
and generalizing the code now rather than trying to extract other
parts of speech or parse more variants of headword/inflection lines or
definition lines.

Andrew Dunbar (hippietrail)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wiktionary-l] Wiktionary parsers