Hi Crederic,
we currently work on an extractor for Wiktionary reusing/extending the
DBpedia framework[1]. In my opinion there is no best practice yet, it's
a work in progress, and it's not trivial: If you want a extractor for
many languages, a straight-forward regex-approach will fail at the
second or third language you want to include, because of heterogeneous
syntax and modeling. So we try to make a declarative parser, that
interprets a rather complex config file, containing little "templates",
that define which element of a Wiktionary page should be interpreted and
processed in which way. we are currently working on it: we made a config
for german and english, and the data we extract is the "entry
layout" (the language, etymology and part of speech - everything thats
in the "outline" boxes) and for each of these found "contexts", we
extract the definition sentence. Of course we aim to extract much more
properties later.
you can checkout our current state at our SVN Repo
hg clone
http://dbpedia.hg.sourceforge.net/hgroot/dbpedia/extraction_framework dbpedia
cd dbpedia
hg update wiktionary
cd core
mvn install
cd ../dump
mvn install
cd ../wiktionary
mkdir wiktionaryDump
copy the enwiktionary-???-pages-articles.xml file from [2] in that new
folder, the language should be the one set in config.xml - and a
config-[language].xml needs to exist
mvn scala:run
The extraction is outputting RDF data (ntriples format, which can be
transformed to xml easily). Unfortunatly there is no comprehensive
documentation yet and we are pre-beta. But we would like to get feedback
and/or requirements. In a few days i will send a official announcement
to our mailinglist [3], containing more details and dump files for en
and de.
But depending on your use case, this could be over-complicated for your
needs, and you would depend on us... If you only need one language,
another idea we had (but not implemented) could be practicable:
Regex-replace everything that has a special semantic with XML nodes.
Then apply a set of rules that hierarchically order this flat sequence
of nodes. At last you can iterate over the XML tree and extract what you
want using xPath or an XML-api.
An example (in pseudo-Scala, not compiling):
val page = "==English==
===Noun===
* Something that is...
===Verb===
* to be..."
now we apply some regexes like
var pageXMLFlat = new Regex("==(.?*)==").replaceAllIn(page, m =>
"<section level='2' title='"+m.group(0)+"'
/>")
...
and we get
<section level="2" title="English" />
<section level="3" title="Noun" />
<indent/><text content="Something that
is..."/><linebreak/>
<section level="3" title="Verb" />
<indent/><text content="to be..."/><linebreak/>
then we try to bring in some hierarchy heuristic
val nodes = XML.fromString(pageXMLFlat)
val stack = new Stack(nodes)
while(stack.size > 0){
val n = stack.pop
val sub = stack.takeWhile(o => n.level != o.level) //well you
get the idea
n addChildren sub
}
and we get something like
<section level="2" title="English">
<section level="3" title="Noun">
<line><indent/><text content="Something that
is..."/></line>
</section>
<section level="3" title="Verb">
<line><indent/><text content="to be..."/></line>
</section>
</section>
as long as the structure of the page is stable (within one language it
mostly is), you can work with this XML... depending on how deep you go
with the replacements (replacing even commas etc. e.g. for the list of
synonyms) you could get a pretty detailed representation of the page.
We would also be interested in that.
Regards,
Jonas
[1]
http://dbpedia.org/About
[2]
http://dumps.wikimedia.org/backup-index.html
[3] open-linguistics(a)lists.okfn.org
Cedric De Vroey <cedric.devroey(a)gmail.com>
wrote:
Hi Guys,
I'd like to integrate and cache wiktionary in an application I'm
developing
but I'm having this problem: How can I retrieve content from wiktionary in
a structured format or translate it to a structured format (with structured
format like XML, CSV, Json,...)? I have already looked into the
Special:Export page but that didn't really helped me cause the actual
content is still all in one field. Are there any known best-practices to do
this?
Thanks!
Cedric
______________________________________________________________
Wiktionary-l mailing list
Wiktionary-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l