Dear all,
the past months Sebastian and I worked on a DBpedia-based extractor for Wiktionary. The main goal was to create one that is so configurable, that applying it to the different languages-versions of Wiktionary is just a matter of configuration but not programming. And the configuration should be possible to do for someone that has a good understanding of the Wiki syntax (and currently XML, but we plan to hide that too, via a web-based frontend similar to the mappings-wiki) but not Scala or RDF.
We now have configs and dumps at the example of the english and german wiktionary, to show the state of our development and initiate a discussion about design and implementation. If you are not interested in the technical stuff you may skip the detailed description below and just evaluate the dump. English contains 16M triples and took 9days on my dual core 2GHz laptop; thats 4 articles per second. German contains 1.3M triples and took 3h (German has only 300k articles, whereas English has 3M and the config is smaller). We know that there is some "noise" in the data (incorrect parsed data), we fix that in the next weeks. The source code is in the "wiktionary" branch of the dbpedia svn-repo, the dump files can be downloaded here: http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_en.nt.bz2 http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_de.nt.bz2
The idea is somewhat different from dbpedia (although we use the framework): Instead of infoboxes and very specific extractors we tried to make a meta-extractor of declarative nature instead of imperative. The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages. So we encode the language-specific characteristics of each Wiktionary in a machine-readable format (e.g. the "config-de.xml"). Top-Down these properties are:
* the Entry Layout (EL) e.g. in the german Wiktionary, a given page has the structure: Lexical Entity -> languages it occurs in -> part of speech it is used as -> different senses/meanings -> properties like synonyms, example sentence, pronunciation, translations, etc. In the english Wiktionary there is a etymology section after the language. These structure implies the schema for our extracted ontology (an ER-Diagram). More about the EL: http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained We configure the EL with nested XML-nodes and define which URIs to be used. The EL does differ greatly between wiktionaries. The question is: how do we assure a common schema? currently we leave that open - the schema of the resulting RDF is implicitly inferred from the PS. Either we come up with a good idea on how to transform it automatically (how to configure it easily to auto-transform) or we leave that merging step open to specialized tools. Which schema should be the global one? Lemon?
* Context defining markers big word, it just means occurring label within the EL. a made up example: == green (Englisch) == === Adjektiv === [1] eine Farbe [2] umweltfreundlich In this wiki snippet there are two CDM: "Englisch" and "Adjektiv". They indicate that the following section is about a english word and its part of speech is adjective. Obviously we need a mapping from them to a shared vocabulary. Such a mapping is easy and part of the configuration. But a nice thing to have would be a ontology backing of this vocabulary. some ontology about PoS (GOLD?) and language families (ISO 639-3?)) - we should discuss what to use there.
* wiki templates now we come to the core of the extraction. We made an engine that can match a given Wiktionary page to several "extraction templates" (ET). An ET is Wiki syntax but it can contain placeholders/variables and controlsymbols that indicate the possible repeating of parts (like the regex "(ab)*" matches "ababab"). The engine then fills the placeholders with information scraped from the page (in other words binds variables). The configuration contains declaration on what to do with the bound variables, often that is a "use it as literal object of predicate x" but we imagine more complex transformations there like "format to URI by ..." or "call a static method y". An example: we have the wiki snippet from above (green) as the page and we defined a ET like this: <template name="definitons"> <vars> <var name="definition" property="rdfs:comment" /> </vars> <wikiSyntax>([$id] $definition )* </wikisyntax> </template> The syntax looks like Regular Expressions but we only allow ()*, ()+, ()?. Then you will notice the variables/placeholders: the extractor will determine whats on the actual page for them. The engine finds a set of bindings: definition -> "eine Farbe" definition -> "umweltfreundlich" and then generates triples according to the config wiktionary:green-english-adjective rdfs:comment "eine Farbe" . wiktionary:green-english-adjective rdfs:comment "umweltfreundlich" .
the used properties (rdfs:comment is just a made up example) and namespaces are open to discussion.
Our prototype recognizes the EL and thus gives information about Languages and PoS usages of all words in the Wiktionary and has ETs for the definitions, hyphenation and example sentences.
The next steps will be either expanding it to more languages or first going deeper within the german and english Wiktionary: finding synonyms (to have community based wordnet) and translations.
So what do you think? What are important things to keep in mind, wishes, comments etc?
Regards, Jonas