Thanks, Brent! I was hoping to get some numbers exactly from you :)
I am extremely curious what kind of statements people will make in the
Wikidata page about "art", "privacy", "agriculture",
"army", etc. I am
looking forward to see what the community will add there. That'll be fun to
watch :)
(Usually, such things tend to be retroactively obvious, but extremely hard
to predict :) )
Cheers,
Denny
2012/4/6 Brent Hecht <brent(a)u.northwestern.edu>
Hi All,
Our data (using a 25-language dataset) agrees with Denny's. 99% of all
connected components of the interlanguage link graph have only one article
per language edition. This is something we looked into in some detail in
our paper at ACM's CHI conference this year (
http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf).
However, it is important to point out that the 1% tends to contain
articles that are of great general interest. Some English articles that
occur in these situations include, "author", "art", "indigenous
people",
"education", "privacy", "liberal arts", "computer
science", "agriculture",
"socialism", "army", etc. To a certain extent, this is to be
expected.
Where there is more global interest in a topic, there is going to be more
ambiguity.
Just my two cents.
- Brent
Brent Hecht
Ph.D. Candidate in Computer Science
CollabLab: The Collaborative Technology Laboratory
Northwestern University
w:
http://www.brenthecht.com
e: brent(a)u.northwestern.edu
On Apr 5, 2012, at 4:50 PM, Denny Vrandečić wrote:
Regarding definitions:
Note that I said "Label + Description is identifying", not merely the
label. I assume this to be true because even for your example of "Germany",
the disambiguation page works with rather short descriptions of each
disambiguated page [1]. So even that fuzzy concept that you gave an example
seems to be sufficiently identifiable for the sake and mission of the
Wikipedia community, which gives me reason to believe that the community
can sort this out. I mean, they basically already had!
Regarding the Kangoo / Kubistar example:
In Wikidata they would be represented as two pages, one for the Kubistar
(which
would link to the Danish and German page for the Kubistar), and one
for the Kangoo (which would link to the 20 language versions of the Kangoo
article, including a Danish and a German one). This is a rather simple
example, which would be easily expressed with the exact matches that we
suggest.
In Wikidata, the Wikipedia links are planned to be inverse functional -
i.e.,
every Wikipedia article in a specific language can only be linked to
from one single Wikidata article. Two Wikidata pages cannot claim the same
Wikipedia article in a single language as their defining article.
I.e. in the Kubistar/Kangoo example there would be two Wikidata pages.
One about
the Kubistar, linking to de:Nissan_Kubistar and
da:Nissan_Kubistar, and one about the Kangoo, linking to the 20 different
Kangoo articles. The Wikidata page for Kubistar could not link to any of
those Kangoo articles.
Please do not misunderstand, I am not categorically against nonexact
matches or
broader or narrower (or else I wouldn't be discussing). But I
haven't seen examples yet that convince me that the additional complexity
of broader/narrower or unexact is required. As I said before, if we can
model more than 99% of all language links with the suggested simple
solution, I am reluctant to make it more complicated for the remaining <1%.
Cheers,
Denny
P.S.: oh, yes, indeed! Thank you for this excellent and interesting
discussion, it
really does shed light on some of the aspects of the current
draft of the data model, and will eventually improve it and sharpen the
understanding of the model.
[1]
https://en.wikipedia.org/wiki/Germany_(disambiguation)
2012/4/5 Gregor Hagedorn <g.m.hagedorn(a)gmail.com>
On 5 April 2012 18:30, Denny Vrandečić <denny.vrandecic(a)wikimedia.de>
wrote:
> The label and the description together are
meant to be identifying.
>
> I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in
Hesse,
Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
I believe it will be difficult to craft labels that work as
definitions. A label is hinting, and may often be sufficiently precise
for the majority of purposes. If we speak of "Germany" it is very hard
to express in a simple string the different historical, geographical,
political delimitations that this term may carry.
In my own field of work even technical terms are often difficult to
resolve to a definition. In biology, the width of taxon delimitations
changes over time and with new research, and even technical terms in
morphologoy often have quite different meanings, depending on the
"school" that is being followed.
Or to cite a car example again: The label "Renault Kangoo" is
unspecific as to the version/revision/release of it, so technical data
that vary between these versions can not be added to it. However, the
de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also
subsumed under "Renault Kangoo". So it is a valid assumption that when
labeling something "Renault Kangoo" it refers to both of these
identical models sold under different names. But then, the "Nissan
Kubistar" is only equivalent to the first version/revision/release of
the "Renault Kangoo"...
This is not unsolvable, but if you want to import or add data to an
element, it will be very hard to judge from a short label the correct
concept. I was hoping that linking this to Wikipedia articles would
help, but this will be hard if a Wikidata page is linked to 40
Wikipedias, any given Wikidata editor can read only a handful of, and
with no support to distinguish between exactMatch and closeMatch.
My suggestions is to allow a differentiation of exactMatch and
closeMatch and instruct editors to use at least one exact match, and
considers this or these the defining wikipedia pages, whereas other
are added as close match.
Of course, the label will remain useful to stumble of changes in
definition of width of concept over time, and correct those after
consulting the revision number to which the original links was formed
(not present, but perhaps achievable by some timestamping and
comparison?)
Gregor
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Project director Wikidata
Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin
Tel. +49-30-219 158 26-0 |
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen
im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Project director Wikidata
Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin
Tel. +49-30-219 158 26-0 |
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.