Wiktionary-l December 2011

wiktionary-l@lists.wikimedia.org

3 participants
4 discussions

[ANN] Wiktionary RDF-extraction with DBpedia for en and de
by Jonas Brekle 21 Dec '11

21 Dec '11

Dear all, the past months Sebastian and I worked on a DBpedia-based extractor for Wiktionary. The main goal was to create one that is so configurable, that applying it to the different languages-versions of Wiktionary is just a matter of configuration but not programming. And the configuration should be possible to do for someone that has a good understanding of the Wiki syntax (and currently XML, but we plan to hide that too, via a web-based frontend similar to the mappings-wiki) but not Scala or RDF. We now have configs and dumps at the example of the english and german wiktionary, to show the state of our development and initiate a discussion about design and implementation. If you are not interested in the technical stuff you may skip the detailed description below and just evaluate the dump. English contains 16M triples and took 9days on my dual core 2GHz laptop; thats 4 articles per second. German contains 1.3M triples and took 3h (German has only 300k articles, whereas English has 3M and the config is smaller). We know that there is some "noise" in the data (incorrect parsed data), we fix that in the next weeks. The source code is in the "wiktionary" branch of the dbpedia svn-repo, the dump files can be downloaded here: http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_en.nt.bz2 http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_de.nt.bz2 The idea is somewhat different from dbpedia (although we use the framework): Instead of infoboxes and very specific extractors we tried to make a meta-extractor of declarative nature instead of imperative. The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages. So we encode the language-specific characteristics of each Wiktionary in a machine-readable format (e.g. the "config-de.xml"). Top-Down these properties are: * the Entry Layout (EL) e.g. in the german Wiktionary, a given page has the structure: Lexical Entity -> languages it occurs in -> part of speech it is used as -> different senses/meanings -> properties like synonyms, example sentence, pronunciation, translations, etc. In the english Wiktionary there is a etymology section after the language. These structure implies the schema for our extracted ontology (an ER-Diagram). More about the EL: http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained We configure the EL with nested XML-nodes and define which URIs to be used. The EL does differ greatly between wiktionaries. The question is: how do we assure a common schema? currently we leave that open - the schema of the resulting RDF is implicitly inferred from the PS. Either we come up with a good idea on how to transform it automatically (how to configure it easily to auto-transform) or we leave that merging step open to specialized tools. Which schema should be the global one? Lemon? * Context defining markers big word, it just means occurring label within the EL. a made up example: == green (Englisch) == === Adjektiv === [1] eine Farbe [2] umweltfreundlich In this wiki snippet there are two CDM: "Englisch" and "Adjektiv". They indicate that the following section is about a english word and its part of speech is adjective. Obviously we need a mapping from them to a shared vocabulary. Such a mapping is easy and part of the configuration. But a nice thing to have would be a ontology backing of this vocabulary. some ontology about PoS (GOLD?) and language families (ISO 639-3?)) - we should discuss what to use there. * wiki templates now we come to the core of the extraction. We made an engine that can match a given Wiktionary page to several "extraction templates" (ET). An ET is Wiki syntax but it can contain placeholders/variables and controlsymbols that indicate the possible repeating of parts (like the regex "(ab)*" matches "ababab"). The engine then fills the placeholders with information scraped from the page (in other words binds variables). The configuration contains declaration on what to do with the bound variables, often that is a "use it as literal object of predicate x" but we imagine more complex transformations there like "format to URI by ..." or "call a static method y". An example: we have the wiki snippet from above (green) as the page and we defined a ET like this: <template name="definitons"> <vars> <var name="definition" property="rdfs:comment" /> </vars> <wikiSyntax>([$id] $definition )* </wikisyntax> </template> The syntax looks like Regular Expressions but we only allow ()*, ()+, ()?. Then you will notice the variables/placeholders: the extractor will determine whats on the actual page for them. The engine finds a set of bindings: definition -> "eine Farbe" definition -> "umweltfreundlich" and then generates triples according to the config wiktionary:green-english-adjective rdfs:comment "eine Farbe" . wiktionary:green-english-adjective rdfs:comment "umweltfreundlich" . the used properties (rdfs:comment is just a made up example) and namespaces are open to discussion. Our prototype recognizes the EL and thus gives information about Languages and PoS usages of all words in the Wiktionary and has ETs for the definitions, hyphenation and example sentences. The next steps will be either expanding it to more languages or first going deeper within the german and english Wiktionary: finding synonyms (to have community based wordnet) and translations. So what do you think? What are important things to keep in mind, wishes, comments etc? Regards, Jonas

1 0

Re: [Wiktionary-l] integrating wiktionary
by Jonas Brekle 13 Dec '11

13 Dec '11

Hi Crederic, i forgot to mention two existing tools (duh) russian and englisch http://code.google.com/p/wikokit/ german and english, well researched, available for research purposes http://www.ukp.tu-darmstadt.de/software/jwktl/ Regards, Jonas > Cedric De Vroey <cedric.devroey(a)gmail.com> wrote: > Hi Guys, > > I'd like to integrate and cache wiktionary in an application I'm developing > but I'm having this problem: How can I retrieve content from wiktionary in > a structured format or translate it to a structured format (with structured > format like XML, CSV, Json,...)? I have already looked into the > Special:Export page but that didn't really helped me cause the actual > content is still all in one field. Are there any known best-practices to do > this? > > Thanks! > Cedric > > ______________________________________________________________ > > Wiktionary-l mailing list > Wiktionary-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

1 0

Re: [Wiktionary-l] integrating wiktionary
by Jonas Brekle 12 Dec '11

12 Dec '11

Hi Crederic, we currently work on an extractor for Wiktionary reusing/extending the DBpedia framework[1]. In my opinion there is no best practice yet, it's a work in progress, and it's not trivial: If you want a extractor for many languages, a straight-forward regex-approach will fail at the second or third language you want to include, because of heterogeneous syntax and modeling. So we try to make a declarative parser, that interprets a rather complex config file, containing little "templates", that define which element of a Wiktionary page should be interpreted and processed in which way. we are currently working on it: we made a config for german and english, and the data we extract is the "entry layout" (the language, etymology and part of speech - everything thats in the "outline" boxes) and for each of these found "contexts", we extract the definition sentence. Of course we aim to extract much more properties later. you can checkout our current state at our SVN Repo hg clone http://dbpedia.hg.sourceforge.net/hgroot/dbpedia/extraction_framework dbpedia cd dbpedia hg update wiktionary cd core mvn install cd ../dump mvn install cd ../wiktionary mkdir wiktionaryDump copy the enwiktionary-???-pages-articles.xml file from [2] in that new folder, the language should be the one set in config.xml - and a config-[language].xml needs to exist mvn scala:run The extraction is outputting RDF data (ntriples format, which can be transformed to xml easily). Unfortunatly there is no comprehensive documentation yet and we are pre-beta. But we would like to get feedback and/or requirements. In a few days i will send a official announcement to our mailinglist [3], containing more details and dump files for en and de. But depending on your use case, this could be over-complicated for your needs, and you would depend on us... If you only need one language, another idea we had (but not implemented) could be practicable: Regex-replace everything that has a special semantic with XML nodes. Then apply a set of rules that hierarchically order this flat sequence of nodes. At last you can iterate over the XML tree and extract what you want using xPath or an XML-api. An example (in pseudo-Scala, not compiling): val page = "==English== ===Noun=== * Something that is... ===Verb=== * to be..." now we apply some regexes like var pageXMLFlat = new Regex("==(.?*)==").replaceAllIn(page, m => "<section level='2' title='"+m.group(0)+"' />") ... and we get <section level="2" title="English" /> <section level="3" title="Noun" /> <indent/><text content="Something that is..."/><linebreak/> <section level="3" title="Verb" /> <indent/><text content="to be..."/><linebreak/> then we try to bring in some hierarchy heuristic val nodes = XML.fromString(pageXMLFlat) val stack = new Stack(nodes) while(stack.size > 0){ val n = stack.pop val sub = stack.takeWhile(o => n.level != o.level) //well you get the idea n addChildren sub } and we get something like <section level="2" title="English"> <section level="3" title="Noun"> <line><indent/><text content="Something that is..."/></line> </section> <section level="3" title="Verb"> <line><indent/><text content="to be..."/></line> </section> </section> as long as the structure of the page is stable (within one language it mostly is), you can work with this XML... depending on how deep you go with the replacements (replacing even commas etc. e.g. for the list of synonyms) you could get a pretty detailed representation of the page. We would also be interested in that. Regards, Jonas [1] http://dbpedia.org/About [2] http://dumps.wikimedia.org/backup-index.html [3] open-linguistics(a)lists.okfn.org > Cedric De Vroey <cedric.devroey(a)gmail.com> wrote: > Hi Guys, > > I'd like to integrate and cache wiktionary in an application I'm developing > but I'm having this problem: How can I retrieve content from wiktionary in > a structured format or translate it to a structured format (with structured > format like XML, CSV, Json,...)? I have already looked into the > Special:Export page but that didn't really helped me cause the actual > content is still all in one field. Are there any known best-practices to do > this? > > Thanks! > Cedric > > ______________________________________________________________ > > Wiktionary-l mailing list > Wiktionary-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

1 0

integrating wiktionary
by Cedric De Vroey 09 Dec '11

09 Dec '11

Hi Guys, I'd like to integrate and cache wiktionary in an application I'm developing but I'm having this problem: How can I retrieve content from wiktionary in a structured format or translate it to a structured format (with structured format like XML, CSV, Json,...)? I have already looked into the Special:Export page but that didn't really helped me cause the actual content is still all in one field. Are there any known best-practices to do this? Thanks! Cedric

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Wiktionary-l December 2011