Re: [Wiktionary-l] integrating wiktionary - Wiktionary-l

12 Dec 2011


      Hi Crederic,
we currently work on an extractor for Wiktionary reusing/extending the
DBpedia framework[1]. In my opinion there is no best practice yet, it's
a work in progress, and it's not trivial: If you want a extractor for
many languages, a straight-forward regex-approach will fail at the
second or third language you want to include, because of heterogeneous
syntax and  modeling. So we try to make a declarative parser, that
interprets a rather complex config file, containing little "templates",
that define which element of a Wiktionary page should be interpreted and
processed in which way. we are currently working on it: we made a config
for german and english, and the data we extract is the "entry
layout" (the language, etymology and part of speech - everything thats
in the "outline" boxes) and for each of these found "contexts", we
extract the definition sentence. Of course we aim to extract much more
properties later.
you can checkout our current state at our SVN Repo
        hg clone
        http://dbpedia.hg.sourceforge.net/hgroot/dbpedia/extraction_framework dbpedia
        cd dbpedia
        hg update wiktionary
        cd core
        mvn install
        cd ../dump
        mvn install
        cd ../wiktionary
        mkdir wiktionaryDump
copy the enwiktionary-???-pages-articles.xml file from [2] in that new
folder, the language should be the one set in config.xml - and a
config-[language].xml needs to exist
        mvn scala:run
The extraction is outputting RDF data (ntriples format, which can be
transformed to xml easily). Unfortunatly there is no comprehensive
documentation yet and we are pre-beta. But we would like to get feedback
and/or requirements. In a few days i will send a official announcement
to our mailinglist [3], containing more details and dump files for en
and de.
But depending on your use case, this could be over-complicated for your
needs, and you would depend on us... If you only need one language,
another idea we had (but not implemented) could be practicable:
Regex-replace everything that has a special semantic with XML nodes.
Then apply a set of rules that hierarchically order this flat sequence
of nodes. At last you can iterate over the XML tree and extract what you
want using xPath or an XML-api.
An example (in pseudo-Scala, not compiling):
        val page = "==English==
        ===Noun===
        * Something that is...
        ===Verb===
        * to be..."
now we apply some regexes like
        var pageXMLFlat = new Regex("==(.?*)==").replaceAllIn(page, m =>
        "<section level='2' title='"+m.group(0)+"' />")
        ...
and we get 
        <section level="2" title="English" />
        <section level="3" title="Noun" />
        <indent/><text content="Something that is..."/><linebreak/>
        <section level="3" title="Verb" />
        <indent/><text content="to be..."/><linebreak/>
then we try to bring in some hierarchy heuristic
        val nodes = XML.fromString(pageXMLFlat)
        val stack = new Stack(nodes)
        while(stack.size > 0){
          val n = stack.pop
          val sub = stack.takeWhile(o => n.level != o.level) //well you
        get the idea 
          n addChildren sub
        }
and we get something like
<section level="2" title="English">
  <section level="3" title="Noun">
    <line><indent/><text content="Something that is..."/></line>
  </section>
  <section level="3" title="Verb">
    <line><indent/><text content="to be..."/></line>
  </section>
</section>
as long as the structure of the page is stable (within one language it
mostly is), you can work with this XML... depending on how deep you go
with the replacements (replacing even commas etc. e.g. for the list of
synonyms) you could get a pretty detailed representation of the page.
We would also be interested in that.
Regards,
Jonas
[1] http://dbpedia.org/About
[2] http://dumps.wikimedia.org/backup-index.html
[3] open-linguistics@lists.okfn.org
...
Cedric De Vroey cedric.devroey@gmail.com wrote:
        Hi Guys,
    I'd like to integrate and cache wiktionary in an application I'm developing
    but I'm having this problem: How can I retrieve content from wiktionary in
    a structured format or translate it to a structured format (with structured
    format like XML, CSV, Json,...)? I have already looked into the
    Special:Export page but that didn't really helped me cause the actual
    content is still all in one field. Are there any known best-practices to do
    this?

    Thanks!
    Cedric

    ______________________________________________________________

    Wiktionary-l mailing list
    Wiktionary-l@lists.wikimedia.org
    https://lists.wikimedia.org/mailman/listinfo/wiktionary-l