Hello, I'm working on the Entry Layouts, which extract wiktionary data into the dbpedia framework. The first thing I'm interested in the link to the base form of an inflected verb/adjective. Although the idea is the same for adjectives and verbs, it would be worth to discuss if we seperate templates and a separate dbpedia-terms for every form (e.g. comparative / superlative forms of adjectives, past participle / simple past / e.g. for verbs. I wrote a template for the superlative form, please tell me if I got it right. I think it would belong in the "pos" block (<block name="pos" property="http://wiktionary.dbpedia.org/terms/hasPoSUsage">).
<template name="lemma-superlative"> <resultTemplates> <resultTemplate> <triples> <triple s="$block" p="http://wiktionary.dbpedia.org/terms/hasLemma" o="http://wiktionary.dbpedia.org/resource/uri($lemma)" oType="URI"/> </triples> </resultTemplate> </resultTemplates> <wikiTemplate># {{superlative of|$lemma}}</wikiTemplate> </template>
On 2012-05-23 17:36, Christoph Lauer wrote:
I'm working on the Entry Layouts, which extract wiktionary data into the dbpedia framework. The first thing I'm interested in the link to the base form of an inflected verb/adjective.
Which language of Wiktionary, and what is your source format? In the English Wiktionary, the category tree under http://en.wiktionary.org/wiki/Category:Form-of_templates_by_language will guide you to wiki templates used to express that an entry is an inflected form of a base word.
For example, two levels down, you will find http://en.wiktionary.org/wiki/Category:Swedish_form-of_templates and http://en.wiktionary.org/wiki/Template:sv-adj-form-abs-indef-n which is used in the entry http://en.wiktionary.org/wiki/oveders%C3%A4gligt to specify that this word is a form of a Swedish adjective. The base word is the first and only parameter.
In the statistics for the English Wiktionary, http://en.wiktionary.org/wiki/Wiktionary:Statistics you can see how many entries are "form-of definitions" for each language, i.e. that there are 79,966 Swedish form-of definitions in the English Wiktionary.
For the analysis, you may want to consult the person who updates that statistics page, Conrad Irving, http://en.wiktionary.org/wiki/User:Conrad.Irwin
On average, each Swedish/Danish/Norwegian base word has 4-5 form variants, which is higher than Dutch and German, but lower than Finnish.
Within the English Wiktionary, each language has its own small subcommunity that might organize things a little different. For example, there is no category for Danish form-of templates. I don't know why. And the Dutch have only one template for adjective forms, using a parameter to say which form it is. While the Swedish use one template for each form, using only one parameter.
Am 23.05.2012 18:29, schrieb Lars Aronsson:
On 2012-05-23 17:36, Christoph Lauer wrote:
I'm working on the Entry Layouts, which extract wiktionary data into the dbpedia framework. The first thing I'm interested in the link to the base form of an inflected verb/adjective.
Which language of Wiktionary, and what is your source format? In the English Wiktionary, the category tree under http://en.wiktionary.org/wiki/Category:Form-of_templates_by_language will guide you to wiki templates used to express that an entry is an inflected form of a base word.
The template I wrote was for the english wiktionary. I'm not sure what you mean by source format; the entry layouts follow the XML standard as described here: http://wiktionary.dbpedia.org/ (just to make sure we're not talking cross purposes ;-) ).
For example, two levels down, you will find http://en.wiktionary.org/wiki/Category:Swedish_form-of_templates and http://en.wiktionary.org/wiki/Template:sv-adj-form-abs-indef-n which is used in the entry http://en.wiktionary.org/wiki/oveders%C3%A4gligt to specify that this word is a form of a Swedish adjective. The base word is the first and only parameter.
Interesting that the english subcategoy is practically empty whereas the swedish subcategory has lots of information about templates. In the given word 'ovedersägligt' you can see that the Wiki code for the 'Adjective' subcategory is
===Adjective=== {{head|sv|adjective form}}
# {{sv-adj-form-abs-indef-n|ovedersäglig}}
With the template I want to catch the last line from that to extract the link to the base form 'ovedersäglig'. DBpedia doesnt have a swedish database (yet), but if you take for example the english word 'took', then you can see the entry under http://wiktionary.dbpedia.org/page/took. There's no reference to the base form, so I would like to add it. Thats what it's all about ;-)
Hi Christoph,
unfortunately this is a bit more complicated: for example in http://en.wiktionary.org/w/index.php?title=biggest&action=edit The "superlative of" template is within the definition list. It would be consumed and interpreted as the definition text. I plan to add a ability to recognize this too (a general node listener interface, so you can register for certain nodes), but not very soon. Your parse template seems goodm, but would not be triggered in this case. All in all, word forms were just not in our focus yet.
Regards, Jonas
Am Mittwoch, den 23.05.2012, 19:10 +0200 schrieb Christoph Lauer:
Am 23.05.2012 18:29, schrieb Lars Aronsson:
On 2012-05-23 17:36, Christoph Lauer wrote:
I'm working on the Entry Layouts, which extract wiktionary data into the dbpedia framework. The first thing I'm interested in the link to the base form of an inflected verb/adjective.
Which language of Wiktionary, and what is your source format? In the English Wiktionary, the category tree under http://en.wiktionary.org/wiki/Category:Form-of_templates_by_language will guide you to wiki templates used to express that an entry is an inflected form of a base word.
The template I wrote was for the english wiktionary. I'm not sure what you mean by source format; the entry layouts follow the XML standard as described here: http://wiktionary.dbpedia.org/ (just to make sure we're not talking cross purposes ;-) ).
For example, two levels down, you will find http://en.wiktionary.org/wiki/Category:Swedish_form-of_templates and http://en.wiktionary.org/wiki/Template:sv-adj-form-abs-indef-n which is used in the entry http://en.wiktionary.org/wiki/oveders%C3%A4gligt to specify that this word is a form of a Swedish adjective. The base word is the first and only parameter.
Interesting that the english subcategoy is practically empty whereas the swedish subcategory has lots of information about templates. In the given word 'ovedersägligt' you can see that the Wiki code for the 'Adjective' subcategory is
===Adjective=== {{head|sv|adjective form}}
# {{sv-adj-form-abs-indef-n|ovedersäglig}}
With the template I want to catch the last line from that to extract the link to the base form 'ovedersäglig'. DBpedia doesnt have a swedish database (yet), but if you take for example the english word 'took', then you can see the entry under http://wiktionary.dbpedia.org/page/took. There's no reference to the base form, so I would like to add it. Thats what it's all about ;-)
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
On 2012-05-23 19:10, Christoph Lauer wrote:
The template I wrote was for the english wiktionary. I'm not sure what you mean by source format; the entry layouts follow the XML standard as described here: http://wiktionary.dbpedia.org/ (just to make sure we're not talking cross purposes ;-) ).
It is indeed confusing that the DBpedia webpage you link to points to this mailing list. It would be really helpful if Jonas Brekle would edit that page to include an introduction on what Wiktionary is (www.wiktionary.org and associated wiki sites in many languages, a project of the Wikimedia Foundation), and explain that his DBpedia project (wiktionary.dbpedia.org) is something else.
The formats delivered by Wiktionary are the live wiki sites and the XML database dumps that you get from http://dumps.wikimedia.org/backup-index.html
Somebody (Jonas?) at DBpedia probably uses the XML dump (?) and transforms that into something that is your source format. I'm not familiar with that transform. I only know Wiktionary.
Wiktionary, like any wiki, is created by many individuals for the instant reward of seeing the result. The sometimes inconsistent use of different wiki templates does not matter, as long as we only care for the human-readable HTML that the wiki shows. For example, instead of the line # {{sv-adj-form-abs-indef-n|ovedersäglig}} I could have written in plain wiki text # ''absolute indefinite neuter form of'' '''[[ovedersäglig]]''' [[Category:Swedish adjective forms]] which produces exactly the same HTML output, even though it would be near impossible to parse for DBpedia.
If you (Jonas) want to extract useful structured data, you need to show that result to the people who edit the wiki, so they can understand where they used the wrong wiki templates or formats. If you parse the XML dumps and find ==Swedish== without any of the proper Swedish form-of templates or declension/conjugation templates, something is probably wrong, and needs fixing.
Interesting that the english subcategoy is practically empty whereas the http://wiktionary.dbpedia.org/page/took. There's no reference to the base form, so I would like to add it. Thats what it's all about ;-)
The English Wiktionary's entry "took" contains the line # {{simple past of|take}} where lang=en is the default parameter.
I am very familiar with the ways the English Wiktionary articles can be formatted (if not the current specific templates for every language), and I have enjoyed playing with DBPedia in the past. I've also done various kinds of parsing Wiktionary data in various languages over the years.
But where can I find the simplest way to run the DBPedia tool that parses an English Wiktionary page? I'd like to try to learn it, which will also serve as an excuse to learn some Scala I believe.
I did look at the documentation I could find but it's quite dense and seems to require quite a bit of deep knowledge in DBPedia concepts and jargon. Is there a easy way in?
Andrew Dunbar (hippietrail)
On 24 May 2012 05:54, Lars Aronsson lars@aronsson.se wrote:
On 2012-05-23 19:10, Christoph Lauer wrote:
The template I wrote was for the english wiktionary. I'm not sure what you mean by source format; the entry layouts follow the XML standard as described here: http://wiktionary.dbpedia.org/ (just to make sure we're not talking cross purposes ;-) ).
It is indeed confusing that the DBpedia webpage you link to points to this mailing list. It would be really helpful if Jonas Brekle would edit that page to include an introduction on what Wiktionary is (www.wiktionary.org and associated wiki sites in many languages, a project of the Wikimedia Foundation), and explain that his DBpedia project (wiktionary.dbpedia.org) is something else.
The formats delivered by Wiktionary are the live wiki sites and the XML database dumps that you get from http://dumps.wikimedia.org/backup-index.html
Somebody (Jonas?) at DBpedia probably uses the XML dump (?) and transforms that into something that is your source format. I'm not familiar with that transform. I only know Wiktionary.
Wiktionary, like any wiki, is created by many individuals for the instant reward of seeing the result. The sometimes inconsistent use of different wiki templates does not matter, as long as we only care for the human-readable HTML that the wiki shows. For example, instead of the line # {{sv-adj-form-abs-indef-n|ovedersäglig}} I could have written in plain wiki text # ''absolute indefinite neuter form of'' '''[[ovedersäglig]]''' [[Category:Swedish adjective forms]] which produces exactly the same HTML output, even though it would be near impossible to parse for DBpedia.
If you (Jonas) want to extract useful structured data, you need to show that result to the people who edit the wiki, so they can understand where they used the wrong wiki templates or formats. If you parse the XML dumps and find ==Swedish== without any of the proper Swedish form-of templates or declension/conjugation templates, something is probably wrong, and needs fixing.
Interesting that the english subcategoy is practically empty whereas the http://wiktionary.dbpedia.org/page/took. There's no reference to the base form, so I would like to add it. Thats what it's all about ;-)
The English Wiktionary's entry "took" contains the line # {{simple past of|take}} where lang=en is the default parameter.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Dear Lars,
On 05/24/2012 03:54 AM, Lars Aronsson wrote:
On 2012-05-23 19:10, Christoph Lauer wrote:
The template I wrote was for the english wiktionary. I'm not sure what you mean by source format; the entry layouts follow the XML standard as described here: http://wiktionary.dbpedia.org/ (just to make sure we're not talking cross purposes ;-) ).
It is indeed confusing that the DBpedia webpage you link to points to this mailing list. It would be really helpful if Jonas Brekle would edit that page to include an introduction on what Wiktionary is (www.wiktionary.org and associated wiki sites in many languages, a project of the Wikimedia Foundation), and explain that his DBpedia project (wiktionary.dbpedia.org) is something else.
I added a line to the description: http://wiki.dbpedia.org/Wiktionary
Some time ago, there was a discussion about the purpose of this list. I know, that a large part of the editors in Wiktionary are only concerned with the human readable appearance. But there are quite a few other user groups that are interested in extracting and using the data. So basically we need a place to coordinate data extraction, templates and data consumption somewhere. I think this list is ideal to get everybody, together, i.e. Christian from JWKTL, Andrew from Wikokit and Jonas and I from DBpedia Wiktionary are on here. Then we have people like Amgine who develop apps and would benefit from more structure and also (now that we focused on this list) new people like Christoph, who are interested in getting data out of Wiktionary. Personally, it is my opinion, that a lot more people would contribute additional data and effort to Wiktionary, if they were able to get it out again.
I firmly believe that it would be a set back to a lot of people, if we started to divide the communities again. The interest in Wiktionary data is immense and vast resources are burnt, just because hundreds of people and companies are building parsers on their own (I know a company employing 2 students 20h/week, just for the parsing).
Jonas and I are trying to make Wiktionary-DBpedia the center of http://linguistics.okfn.org/resources/llod/ just as DBpedia is the center of this: http://lod-cloud.net/
The data can also be used to fix things in Wiktionary. Do you have a special problem that causes a lot of distress and work amongst the editors? e.g. translation link consitency or update/maintainence procedures? We could try to create apps that help editors, but we would need a problem description.
All the best, Sebastian
The formats delivered by Wiktionary are the live wiki sites and the XML database dumps that you get from http://dumps.wikimedia.org/backup-index.html
Somebody (Jonas?) at DBpedia probably uses the XML dump (?) and transforms that into something that is your source format. I'm not familiar with that transform. I only know Wiktionary.
Wiktionary, like any wiki, is created by many individuals for the instant reward of seeing the result. The sometimes inconsistent use of different wiki templates does not matter, as long as we only care for the human-readable HTML that the wiki shows. For example, instead of the line # {{sv-adj-form-abs-indef-n|ovedersäglig}} I could have written in plain wiki text # ''absolute indefinite neuter form of'' '''[[ovedersäglig]]''' [[Category:Swedish adjective forms]] which produces exactly the same HTML output, even though it would be near impossible to parse for DBpedia.
If you (Jonas) want to extract useful structured data, you need to show that result to the people who edit the wiki, so they can understand where they used the wrong wiki templates or formats. If you parse the XML dumps and find ==Swedish== without any of the proper Swedish form-of templates or declension/conjugation templates, something is probably wrong, and needs fixing.
Interesting that the english subcategoy is practically empty whereas the http://wiktionary.dbpedia.org/page/took. There's no reference to the base form, so I would like to add it. Thats what it's all about ;-)
The English Wiktionary's entry "took" contains the line # {{simple past of|take}} where lang=en is the default parameter.
On 2012-05-24 18:33, Sebastian Hellmann wrote:
I firmly believe that it would be a set back to a lot of people, if we started to divide the communities again.
I don't want to divide the community. The problem I see is people look at DBpedia's RDF output, then come to this list (where DBpedia's page directs them) and complain that Wiktionary looks strange, because they think the RDF output is Wiktionary. I added an introductory paragraph that hopefully will reduce this risk.
The Wiktionary community is far smaller and more specialized than the Wikipedia community, and by presenting statistics and arguments, you should be able to influence how Wiktionary works, to facilitate your RDF extraction. For example, if you find some Swedish language templates deviate from a common pattern, just say so, and they can be fixed.
Thanks for clarifying this and helping to improve the Wiki page. We will look into the Swedish pattern. It occurred to me, that we might use the framework not only to extract facts, but also deploy it to find errors (not only when lexical data is missing, but directly extract information where the sole purpose is to provide debugging messages).
Sebastian
On 05/25/2012 04:20 AM, Lars Aronsson wrote:
On 2012-05-24 18:33, Sebastian Hellmann wrote:
I firmly believe that it would be a set back to a lot of people, if we started to divide the communities again.
I don't want to divide the community. The problem I see is people look at DBpedia's RDF output, then come to this list (where DBpedia's page directs them) and complain that Wiktionary looks strange, because they think the RDF output is Wiktionary. I added an introductory paragraph that hopefully will reduce this risk.
The Wiktionary community is far smaller and more specialized than the Wikipedia community, and by presenting statistics and arguments, you should be able to influence how Wiktionary works, to facilitate your RDF extraction. For example, if you find some Swedish language templates deviate from a common pattern, just say so, and they can be fixed.
wiktionary-l@lists.wikimedia.org