I am very familiar with the ways the English Wiktionary articles can be formatted (if not the current specific templates for every language), and I have enjoyed playing with DBPedia in the past. I've also done various kinds of parsing Wiktionary data in various languages over the years.
But where can I find the simplest way to run the DBPedia tool that parses an English Wiktionary page? I'd like to try to learn it, which will also serve as an excuse to learn some Scala I believe.
I did look at the documentation I could find but it's quite dense and seems to require quite a bit of deep knowledge in DBPedia concepts and jargon. Is there a easy way in?
Andrew Dunbar (hippietrail)
On 24 May 2012 05:54, Lars Aronsson lars@aronsson.se wrote:
On 2012-05-23 19:10, Christoph Lauer wrote:
The template I wrote was for the english wiktionary. I'm not sure what you mean by source format; the entry layouts follow the XML standard as described here: http://wiktionary.dbpedia.org/ (just to make sure we're not talking cross purposes ;-) ).
It is indeed confusing that the DBpedia webpage you link to points to this mailing list. It would be really helpful if Jonas Brekle would edit that page to include an introduction on what Wiktionary is (www.wiktionary.org and associated wiki sites in many languages, a project of the Wikimedia Foundation), and explain that his DBpedia project (wiktionary.dbpedia.org) is something else.
The formats delivered by Wiktionary are the live wiki sites and the XML database dumps that you get from http://dumps.wikimedia.org/backup-index.html
Somebody (Jonas?) at DBpedia probably uses the XML dump (?) and transforms that into something that is your source format. I'm not familiar with that transform. I only know Wiktionary.
Wiktionary, like any wiki, is created by many individuals for the instant reward of seeing the result. The sometimes inconsistent use of different wiki templates does not matter, as long as we only care for the human-readable HTML that the wiki shows. For example, instead of the line # {{sv-adj-form-abs-indef-n|ovedersäglig}} I could have written in plain wiki text # ''absolute indefinite neuter form of'' '''[[ovedersäglig]]''' [[Category:Swedish adjective forms]] which produces exactly the same HTML output, even though it would be near impossible to parse for DBpedia.
If you (Jonas) want to extract useful structured data, you need to show that result to the people who edit the wiki, so they can understand where they used the wrong wiki templates or formats. If you parse the XML dumps and find ==Swedish== without any of the proper Swedish form-of templates or declension/conjugation templates, something is probably wrong, and needs fixing.
Interesting that the english subcategoy is practically empty whereas the http://wiktionary.dbpedia.org/page/took. There's no reference to the base form, so I would like to add it. Thats what it's all about ;-)
The English Wiktionary's entry "took" contains the line # {{simple past of|take}} where lang=en is the default parameter.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l