Dear all,
the past months Sebastian and I worked on a DBpedia-based extractor for
Wiktionary. The main goal was to create one that is so configurable,
that applying it to the different languages-versions of Wiktionary is
just a matter of configuration but not programming. And the
configuration should be possible to do for someone that has a good
understanding of the Wiki syntax (and currently XML, but we plan to hide
that too, via a web-based frontend similar to the mappings-wiki) but not
Scala or RDF.
We now have configs and dumps at the example of the english and german wiktionary, to show the state of our development and
initiate a discussion about design and implementation. If you are not
interested in the technical stuff you may skip the detailed
description below and just evaluate the dump.
English contains 16M triples and took 9days on my dual core 2GHz laptop; thats 4 articles per second.
German contains 1.3M triples and took 3h (German has only 300k articles, whereas English has 3M and the config is smaller). We know that there is some "noise" in the data (incorrect parsed data), we fix that in the next weeks.
The source code is in the "wiktionary" branch of the dbpedia svn-repo, the dump files can be downloaded here:
http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_en.nt.bz2http://downloads.dbpedia.org/wiktionary/wiktionary.dbpedia.org_de.nt.bz2
The idea is somewhat different from dbpedia (although we use the framework):
Instead of infoboxes and very specific extractors we tried to make a meta-extractor of declarative nature instead of imperative.
The rational is that although there exist some scrapers, none of them allows to parse more than a 3-5 languages.
So we encode the language-specific characteristics of each Wiktionary in a machine-readable format (e.g. the "config-de.xml").
Top-Down these properties are:
* the Entry Layout (EL)
e.g. in the german Wiktionary, a given page has the structure: Lexical
Entity -> languages it occurs in -> part of speech it is used as ->
different senses/meanings -> properties like synonyms, example sentence,
pronunciation, translations, etc.
In the english Wiktionary there is a etymology section after the language.
These structure implies the schema for our extracted ontology (an ER-Diagram).
More about the EL: http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained
We configure the EL with nested XML-nodes and define which URIs to be used.
The EL does differ greatly between wiktionaries.
The question is: how do we assure a common schema?
currently we leave that open - the schema of the resulting RDF is
implicitly inferred from the PS. Either we come up with a good idea on
how to transform it automatically (how to configure it easily to
auto-transform) or we leave that merging step open to specialized tools. Which schema should be the global one? Lemon?
* Context defining markers
big word, it just means occurring label within the EL. a
made up example:
== green (Englisch) ==
=== Adjektiv ===
[1] eine Farbe
[2] umweltfreundlich
In this wiki snippet there are two CDM: "Englisch" and "Adjektiv". They
indicate that the following section is about a english word and its part
of speech is adjective. Obviously we need a mapping from them to a
shared vocabulary. Such a mapping is easy and part of the configuration.
But a nice thing to have would be a ontology backing of this vocabulary.
some ontology about PoS (GOLD?) and language families (ISO 639-3?)) - we
should discuss what to use there.
* wiki templates
now we come to the core of the extraction. We made an engine that can
match a given Wiktionary page to several "extraction templates" (ET). An ET is Wiki syntax but it can contain
placeholders/variables and controlsymbols that indicate the possible
repeating of parts (like the regex "(ab)*" matches "ababab"). The engine
then fills the placeholders with information scraped from the page (in
other words binds variables). The configuration contains declaration on
what to do with the bound variables, often that is a "use it as literal
object of predicate x" but we imagine more complex transformations there
like "format to URI by ..." or "call a static method y".
An example: we have the wiki snippet from above (green) as the page and
we defined a ET like this:
<template name="definitons">
<vars>
<var name="definition" property="rdfs:comment" />
</vars>
<wikiSyntax>([$id] $definition
)*
</wikisyntax>
</template>
The syntax looks like Regular Expressions but we only allow ()*, ()+, ()?.
Then you will notice the variables/placeholders: the extractor will determine whats on the actual page for them.
The engine finds a set of bindings:
definition -> "eine Farbe"
definition -> "umweltfreundlich"
and then generates triples according to the config
wiktionary:green-english-adjective rdfs:comment "eine Farbe" .
wiktionary:green-english-adjective rdfs:comment "umweltfreundlich" .
the used properties (rdfs:comment is just a made up example) and
namespaces are open to discussion.
Our prototype recognizes the EL and thus gives information about Languages and PoS usages of all
words in the Wiktionary and has ETs for the definitions, hyphenation and example sentences.
The next steps will be either expanding it to more languages or first going deeper within the german and english Wiktionary: finding synonyms (to have community based wordnet) and translations.
So what do you think? What are important things to keep in mind, wishes,
comments etc?
Regards,
Jonas
Hi Crederic,
i forgot to mention two existing tools (duh)
russian and englisch
http://code.google.com/p/wikokit/
german and english, well researched, available for research purposes
http://www.ukp.tu-darmstadt.de/software/jwktl/
Regards,
Jonas
> Cedric De Vroey <cedric.devroey(a)gmail.com> wrote:
> Hi Guys,
>
> I'd like to integrate and cache wiktionary in an application I'm developing
> but I'm having this problem: How can I retrieve content from wiktionary in
> a structured format or translate it to a structured format (with structured
> format like XML, CSV, Json,...)? I have already looked into the
> Special:Export page but that didn't really helped me cause the actual
> content is still all in one field. Are there any known best-practices to do
> this?
>
> Thanks!
> Cedric
>
> ______________________________________________________________
>
> Wiktionary-l mailing list
> Wiktionary-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Hi Crederic,
we currently work on an extractor for Wiktionary reusing/extending the
DBpedia framework[1]. In my opinion there is no best practice yet, it's
a work in progress, and it's not trivial: If you want a extractor for
many languages, a straight-forward regex-approach will fail at the
second or third language you want to include, because of heterogeneous
syntax and modeling. So we try to make a declarative parser, that
interprets a rather complex config file, containing little "templates",
that define which element of a Wiktionary page should be interpreted and
processed in which way. we are currently working on it: we made a config
for german and english, and the data we extract is the "entry
layout" (the language, etymology and part of speech - everything thats
in the "outline" boxes) and for each of these found "contexts", we
extract the definition sentence. Of course we aim to extract much more
properties later.
you can checkout our current state at our SVN Repo
hg clone
http://dbpedia.hg.sourceforge.net/hgroot/dbpedia/extraction_framework dbpedia
cd dbpedia
hg update wiktionary
cd core
mvn install
cd ../dump
mvn install
cd ../wiktionary
mkdir wiktionaryDump
copy the enwiktionary-???-pages-articles.xml file from [2] in that new
folder, the language should be the one set in config.xml - and a
config-[language].xml needs to exist
mvn scala:run
The extraction is outputting RDF data (ntriples format, which can be
transformed to xml easily). Unfortunatly there is no comprehensive
documentation yet and we are pre-beta. But we would like to get feedback
and/or requirements. In a few days i will send a official announcement
to our mailinglist [3], containing more details and dump files for en
and de.
But depending on your use case, this could be over-complicated for your
needs, and you would depend on us... If you only need one language,
another idea we had (but not implemented) could be practicable:
Regex-replace everything that has a special semantic with XML nodes.
Then apply a set of rules that hierarchically order this flat sequence
of nodes. At last you can iterate over the XML tree and extract what you
want using xPath or an XML-api.
An example (in pseudo-Scala, not compiling):
val page = "==English==
===Noun===
* Something that is...
===Verb===
* to be..."
now we apply some regexes like
var pageXMLFlat = new Regex("==(.?*)==").replaceAllIn(page, m =>
"<section level='2' title='"+m.group(0)+"' />")
...
and we get
<section level="2" title="English" />
<section level="3" title="Noun" />
<indent/><text content="Something that is..."/><linebreak/>
<section level="3" title="Verb" />
<indent/><text content="to be..."/><linebreak/>
then we try to bring in some hierarchy heuristic
val nodes = XML.fromString(pageXMLFlat)
val stack = new Stack(nodes)
while(stack.size > 0){
val n = stack.pop
val sub = stack.takeWhile(o => n.level != o.level) //well you
get the idea
n addChildren sub
}
and we get something like
<section level="2" title="English">
<section level="3" title="Noun">
<line><indent/><text content="Something that is..."/></line>
</section>
<section level="3" title="Verb">
<line><indent/><text content="to be..."/></line>
</section>
</section>
as long as the structure of the page is stable (within one language it
mostly is), you can work with this XML... depending on how deep you go
with the replacements (replacing even commas etc. e.g. for the list of
synonyms) you could get a pretty detailed representation of the page.
We would also be interested in that.
Regards,
Jonas
[1] http://dbpedia.org/About
[2] http://dumps.wikimedia.org/backup-index.html
[3] open-linguistics(a)lists.okfn.org
> Cedric De Vroey <cedric.devroey(a)gmail.com> wrote:
> Hi Guys,
>
> I'd like to integrate and cache wiktionary in an application I'm developing
> but I'm having this problem: How can I retrieve content from wiktionary in
> a structured format or translate it to a structured format (with structured
> format like XML, CSV, Json,...)? I have already looked into the
> Special:Export page but that didn't really helped me cause the actual
> content is still all in one field. Are there any known best-practices to do
> this?
>
> Thanks!
> Cedric
>
> ______________________________________________________________
>
> Wiktionary-l mailing list
> Wiktionary-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Hi Guys,
I'd like to integrate and cache wiktionary in an application I'm developing
but I'm having this problem: How can I retrieve content from wiktionary in
a structured format or translate it to a structured format (with structured
format like XML, CSV, Json,...)? I have already looked into the
Special:Export page but that didn't really helped me cause the actual
content is still all in one field. Are there any known best-practices to do
this?
Thanks!
Cedric