Hi Crederic,
we currently work on an extractor for Wiktionary reusing/extending the DBpedia framework[1]. In my opinion there is no best practice yet, it's a work in progress, and it's not trivial: If you want a extractor for many languages, a straight-forward regex-approach will fail at the second or third language you want to include, because of heterogeneous syntax and modeling. So we try to make a declarative parser, that interprets a rather complex config file, containing little "templates", that define which element of a Wiktionary page should be interpreted and processed in which way. we are currently working on it: we made a config for german and english, and the data we extract is the "entry layout" (the language, etymology and part of speech - everything thats in the "outline" boxes) and for each of these found "contexts", we extract the definition sentence. Of course we aim to extract much more properties later. you can checkout our current state at our SVN Repo hg clone http://dbpedia.hg.sourceforge.net/hgroot/dbpedia/extraction_framework dbpedia cd dbpedia hg update wiktionary cd core mvn install cd ../dump mvn install cd ../wiktionary mkdir wiktionaryDump copy the enwiktionary-???-pages-articles.xml file from [2] in that new folder, the language should be the one set in config.xml - and a config-[language].xml needs to exist mvn scala:run
The extraction is outputting RDF data (ntriples format, which can be transformed to xml easily). Unfortunatly there is no comprehensive documentation yet and we are pre-beta. But we would like to get feedback and/or requirements. In a few days i will send a official announcement to our mailinglist [3], containing more details and dump files for en and de.
But depending on your use case, this could be over-complicated for your needs, and you would depend on us... If you only need one language, another idea we had (but not implemented) could be practicable: Regex-replace everything that has a special semantic with XML nodes. Then apply a set of rules that hierarchically order this flat sequence of nodes. At last you can iterate over the XML tree and extract what you want using xPath or an XML-api. An example (in pseudo-Scala, not compiling): val page = "==English== ===Noun=== * Something that is... ===Verb=== * to be..." now we apply some regexes like var pageXMLFlat = new Regex("==(.?*)==").replaceAllIn(page, m => "<section level='2' title='"+m.group(0)+"' />") ... and we get <section level="2" title="English" /> <section level="3" title="Noun" /> <indent/><text content="Something that is..."/><linebreak/> <section level="3" title="Verb" /> <indent/><text content="to be..."/><linebreak/> then we try to bring in some hierarchy heuristic val nodes = XML.fromString(pageXMLFlat) val stack = new Stack(nodes) while(stack.size > 0){ val n = stack.pop val sub = stack.takeWhile(o => n.level != o.level) //well you get the idea n addChildren sub } and we get something like <section level="2" title="English"> <section level="3" title="Noun"> <line><indent/><text content="Something that is..."/></line> </section> <section level="3" title="Verb"> <line><indent/><text content="to be..."/></line> </section> </section> as long as the structure of the page is stable (within one language it mostly is), you can work with this XML... depending on how deep you go with the replacements (replacing even commas etc. e.g. for the list of synonyms) you could get a pretty detailed representation of the page. We would also be interested in that.
Regards, Jonas
[1] http://dbpedia.org/About [2] http://dumps.wikimedia.org/backup-index.html [3] open-linguistics@lists.okfn.org
Cedric De Vroey cedric.devroey@gmail.com wrote: Hi Guys,
I'd like to integrate and cache wiktionary in an application I'm developing but I'm having this problem: How can I retrieve content from wiktionary in a structured format or translate it to a structured format (with structured format like XML, CSV, Json,...)? I have already looked into the Special:Export page but that didn't really helped me cause the actual content is still all in one field. Are there any known best-practices to do this? Thanks! Cedric ______________________________________________________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
wiktionary-l@lists.wikimedia.org