Hi All,
Our data (using a 25-language dataset) agrees with Denny's. 99% of all connected components of the interlanguage link graph have only one article per language edition. This is something we looked into in some detail in our paper at ACM's CHI conference this year (http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf).
However, it is important to point out that the 1% tends to contain articles that are of great general interest. Some English articles that occur in these situations include, "author", "art", "indigenous people", "education", "privacy", "liberal arts", "computer science", "agriculture", "socialism", "army", etc. To a certain extent, this is to be expected. Where there is more global interest in a topic, there is going to be more ambiguity.
Just my two cents.
- Brent
Brent Hecht Ph.D. Candidate in Computer Science CollabLab: The Collaborative Technology Laboratory Northwestern University w: http://www.brenthecht.com e: brent@u.northwestern.edu
On Apr 5, 2012, at 4:50 PM, Denny Vrandečić wrote:
Regarding definitions:
Note that I said "Label + Description is identifying", not merely the label. I assume this to be true because even for your example of "Germany", the disambiguation page works with rather short descriptions of each disambiguated page [1]. So even that fuzzy concept that you gave an example seems to be sufficiently identifiable for the sake and mission of the Wikipedia community, which gives me reason to believe that the community can sort this out. I mean, they basically already had!
Regarding the Kangoo / Kubistar example:
In Wikidata they would be represented as two pages, one for the Kubistar (which would link to the Danish and German page for the Kubistar), and one for the Kangoo (which would link to the 20 language versions of the Kangoo article, including a Danish and a German one). This is a rather simple example, which would be easily expressed with the exact matches that we suggest.
In Wikidata, the Wikipedia links are planned to be inverse functional - i.e., every Wikipedia article in a specific language can only be linked to from one single Wikidata article. Two Wikidata pages cannot claim the same Wikipedia article in a single language as their defining article.
I.e. in the Kubistar/Kangoo example there would be two Wikidata pages. One about the Kubistar, linking to de:Nissan_Kubistar and da:Nissan_Kubistar, and one about the Kangoo, linking to the 20 different Kangoo articles. The Wikidata page for Kubistar could not link to any of those Kangoo articles.
Please do not misunderstand, I am not categorically against nonexact matches or broader or narrower (or else I wouldn't be discussing). But I haven't seen examples yet that convince me that the additional complexity of broader/narrower or unexact is required. As I said before, if we can model more than 99% of all language links with the suggested simple solution, I am reluctant to make it more complicated for the remaining <1%.
Cheers, Denny
P.S.: oh, yes, indeed! Thank you for this excellent and interesting discussion, it really does shed light on some of the aspects of the current draft of the data model, and will eventually improve it and sharpen the understanding of the model.
[1] https://en.wikipedia.org/wiki/Germany_(disambiguation)
2012/4/5 Gregor Hagedorn g.m.hagedorn@gmail.com On 5 April 2012 18:30, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
The label and the description together are meant to be identifying.
I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in Hesse, Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
I believe it will be difficult to craft labels that work as definitions. A label is hinting, and may often be sufficiently precise for the majority of purposes. If we speak of "Germany" it is very hard to express in a simple string the different historical, geographical, political delimitations that this term may carry.
In my own field of work even technical terms are often difficult to resolve to a definition. In biology, the width of taxon delimitations changes over time and with new research, and even technical terms in morphologoy often have quite different meanings, depending on the "school" that is being followed.
Or to cite a car example again: The label "Renault Kangoo" is unspecific as to the version/revision/release of it, so technical data that vary between these versions can not be added to it. However, the de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also subsumed under "Renault Kangoo". So it is a valid assumption that when labeling something "Renault Kangoo" it refers to both of these identical models sold under different names. But then, the "Nissan Kubistar" is only equivalent to the first version/revision/release of the "Renault Kangoo"...
This is not unsolvable, but if you want to import or add data to an element, it will be very hard to judge from a short label the correct concept. I was hoping that linking this to Wikipedia articles would help, but this will be hard if a Wikidata page is linked to 40 Wikipedias, any given Wikidata editor can read only a handful of, and with no support to distinguish between exactMatch and closeMatch.
My suggestions is to allow a differentiation of exactMatch and closeMatch and instruct editors to use at least one exact match, and considers this or these the defining wikipedia pages, whereas other are added as close match.
Of course, the label will remain useful to stumble of changes in definition of width of concept over time, and correct those after consulting the revision number to which the original links was formed (not present, but perhaps achievable by some timestamping and comparison?)
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l