Thanks, Brent! I was hoping to get some numbers exactly from you :)

I am extremely curious what kind of statements people will make in the Wikidata page about "art", "privacy", "agriculture", "army", etc. I am looking forward to see what the community will add there. That'll be fun to watch :)

(Usually, such things tend to be retroactively obvious, but extremely hard to predict :) )

Cheers,
Denny

2012/4/6 Brent Hecht <brent@u.northwestern.edu>
Hi All,

Our data (using a 25-language dataset) agrees with Denny's. 99% of all connected components of the interlanguage link graph have only one article per language edition. This is something we looked into in some detail in our paper at ACM's CHI conference this year (http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf).

However, it is important to point out that the 1% tends to contain articles that are of great general interest. Some English articles that occur in these situations include, "author", "art", "indigenous people", "education", "privacy", "liberal arts", "computer science", "agriculture", "socialism", "army", etc. To a certain extent, this is to be expected. Where there is more global interest in a topic, there is going to be more ambiguity.

Just my two cents.

- Brent


Brent Hecht
Ph.D. Candidate in Computer Science
CollabLab: The Collaborative Technology Laboratory
Northwestern University
w: http://www.brenthecht.com
e: brent@u.northwestern.edu


On Apr 5, 2012, at 4:50 PM, Denny Vrandečić wrote:

> Regarding definitions:
>
> Note that I said "Label + Description is identifying", not merely the label. I assume this to be true because even for your example of "Germany", the disambiguation page works with rather short descriptions of each disambiguated page [1]. So even that fuzzy concept that you gave an example seems to be sufficiently identifiable for the sake and mission of the Wikipedia community, which gives me reason to believe that the community can sort this out. I mean, they basically already had!
>
> Regarding the Kangoo / Kubistar example:
>
> In Wikidata they would be represented as two pages, one for the Kubistar (which would link to the Danish and German page for the Kubistar), and one for the Kangoo (which would link to the 20 language versions of the Kangoo article, including a Danish and a German one). This is a rather simple example, which would be easily expressed with the exact matches that we suggest.
>
> In Wikidata, the Wikipedia links are planned to be inverse functional - i.e., every Wikipedia article in a specific language can only be linked to from one single Wikidata article. Two Wikidata pages cannot claim the same Wikipedia article in a single language as their defining article.
>
> I.e. in the Kubistar/Kangoo example there would be two Wikidata pages. One about the Kubistar, linking to de:Nissan_Kubistar and da:Nissan_Kubistar, and one about the Kangoo, linking to the 20 different Kangoo articles. The Wikidata page for Kubistar could not link to any of those Kangoo articles.
>
> Please do not misunderstand, I am not categorically against nonexact matches or broader or narrower (or else I wouldn't be discussing). But I haven't seen examples yet that convince me that the additional complexity of broader/narrower or unexact is required. As I said before, if we can model more than 99% of all language links with the suggested simple solution, I am reluctant to make it more complicated for the remaining <1%.
>
> Cheers,
> Denny
>
> P.S.: oh, yes, indeed! Thank you for this excellent and interesting discussion, it really does shed light on some of the aspects of the current draft of the data model, and will eventually improve it and sharpen the understanding of the model.
>
> [1] https://en.wikipedia.org/wiki/Germany_(disambiguation)
>
>
>
> 2012/4/5 Gregor Hagedorn <g.m.hagedorn@gmail.com>
> On 5 April 2012 18:30, Denny Vrandečić <denny.vrandecic@wikimedia.de> wrote:
> > The label and the description together are meant to be identifying.
> >
> > I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in Hesse,
> > Germany", etc.
> >
> > Additionally, the Wikipedia links provide quite some guidance to it.
>
> I believe it will be difficult to craft labels that work as
> definitions. A label is hinting, and may often be sufficiently precise
> for the majority of purposes. If we speak of "Germany" it is very hard
> to express in a simple string the different historical, geographical,
> political delimitations that this term may carry.
>
> In my own field of work even technical terms are often difficult to
> resolve to a definition. In biology, the width of taxon delimitations
> changes over time and with new research, and even technical terms in
> morphologoy often have quite different meanings, depending on the
> "school" that is being followed.
>
> Or to cite a car example again: The label "Renault Kangoo" is
> unspecific as to the version/revision/release of it, so technical data
> that vary between these versions can not be added to it. However, the
> de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also
> subsumed under "Renault Kangoo". So it is a valid assumption that when
> labeling something "Renault Kangoo" it refers to both of these
> identical models sold under different names. But then, the "Nissan
> Kubistar" is only equivalent to the first version/revision/release of
> the "Renault Kangoo"...
>
> This is not unsolvable, but if you want to import or add data to an
> element, it will be very hard to judge from a short label the correct
> concept. I was hoping that linking this to Wikipedia articles would
> help, but this will be hard if a Wikidata page is linked to 40
> Wikipedias, any given Wikidata editor can read only a handful of, and
> with no support to distinguish between exactMatch and closeMatch.
>
> My suggestions is to allow a differentiation of exactMatch and
> closeMatch and instruct editors to use at least one exact match, and
> considers this or these the defining wikipedia pages, whereas other
> are added as close match.
>
> Of course, the label will remain useful to stumble of changes in
> definition of width of concept over time, and correct those after
> consulting the revision number to which the original links was formed
> (not present, but perhaps achievable by some timestamping and
> comparison?)
>
> Gregor
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
>
>
> --
> Project director Wikidata
> Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l









_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



--
Project director Wikidata
Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.