(putting it back on list as I think it helps to reach a common understanding of what we try and what we should achieve)
2012/4/6 Gregor Hagedorn g.m.hagedorn@gmail.com
off list, because I am afraid this is getting off-topic, but you can put it back to list if you like:
In Wikidata they would be represented as two pages, one for the Kubistar (which would link to the Danish and German page for the Kubistar), and
one
for the Kangoo (which would link to the 20 language versions of the
Kangoo
article, including a Danish and a German one). This is a rather simple example, which would be easily expressed with the exact matches that we suggest.
So far I fail to understand: where would the actual data for the model live (the kangoo page is a summary page for models with different specs in the infoboxes)
I assume on separate wikidata pages, that have no relation to the kangoo page?
Correct. Information about different models would be given on different Kangoo pages. If there is no Wikipedia page for that model in a given Wikipedia, no Wikipedia link should be given. (This does not mean that there can be no information displayed about them in the given Wikipedia: any Wikipedia article will be able to display any information about any item in Wikidata in phase 3).
I am concerned with reverse discovery of information by editors coming from wikipedia to wikidata. I realize this is a separate topic, but one in the back of my mind (or rather in my use case scenario) and why I am arguing to allow such relations.
There will be plenty of links connecting Wikidata items with each other. I don't think that this kind of information discovery will be a major hurdle.
The kubistar example is interesting to me, because wikidata would then suggest the english wikipedia has no information on the kubistar, whereas it really is available on the kangoo page.
this is in fact one thing that the interlanguage links have been used for, but because of situations as above with limited success.
The assumption that because there is no article about X in a given Wikipedia, there is no information about X in that Wikipedia, is not correct, and should not be made.
from one single Wikidata article. Two Wikidata pages cannot claim the
same
Wikipedia article in a single language as their defining article.
here you speak of defining article, where currently it is a set of more or less roughly related wikipedia pages in different languages.
Our assumption is that in general if one Wikipedia page is identifying a topic, the ones connected through interwiki links are identifying the same topic. The rules for interwiki links in the German Wikipedia mandate that [1], the ones in the English are only little bit less strict about that [2]. For the few exceptions where that is not the case, interwiki links can be set and overwritten locally.
[1] https://de.wikipedia.org/wiki/Hilfe:Internationalisierung [2] https://en.wikipedia.org/wiki/Help:Interlanguage_links
late in the night, like for you as well...
Greogr
Many thanks for this discussion and the clarifications. I believe I understand much better. Below I summarize my understanding; much of it matches the existing documentation, but if someone in the Wikidata team finds the time to check whether the present documentation can be improved, or whether it was just my personal misunderstanding, the summary below may be helpful. In retrospect, I believe much of my misunderstanding is based on the present focus on phase 1, interlanguage links.
----
1. A Wikidata page is a primary object, a "Wikiconcept" at any desired level of granularity. Except for redirects or disambiguations, each Wikipedia page in the article namespace will initially at least have one Wikidata concept (page), because all concepts defined in Wikipedia are considered relevant to Wikidata. However, many more Wikidata concepts (pages) need to be created beyond the Wikipedia concepts. The concept granularity required for Wikidata is much finer than the typical (aggregated) granularity used on Wikipedia.
2. One or several Wikipedia language version pages may be linked to this Wikidata concept (Wikidata 1 to Wikipedia 0..n). Importantly (and something that was either unclear or I overlooked it): ALL such Wikipedia language version pages are considered to be completely and equally defining the Wikidata concept (page).
3. The initial setup of Wikidata concepts and Wikipedia links will be from the interlanguage links in the Wikipedias. However, these links are _not_ meant to represent interlanguage links per see with their need to express both exact and close matches. The import of interlanguage links is rather designed as a _seeding_ mechanism for Wikidata.
4. The cases where concepts are overlapping across languages (e.g. different words for waterbodies classified by size), where data require more precise concept definitions than textual descriptions, or plain imprecise linking need to removed from wikidata by a to-be-founded editing community.
Some conclusions from my point of view:
a) I find it unlikely that 2 "interlanguage-concept-relation-editing-communities" will be willing to work in parallel. I believe the Wikidata project must find a point in time, where the editing of interlanguage links is switched over to Wikidata. Keeping it in parallel without an active community will create problems due to concept drift in Wikipedia pages.
b) Early on in the project, after a testing period, but not after the wikidata concepts start being linked to permanent data, provide the means of editing the entirety of interlanguage links, including those expressing a "closeMatch". This may need further analysis.
c) The Wikipedia page _version_, which is considered to be the defining one, should be recorded as part of the wikidata model.
d) To consider: Perhaps avoid calling the Wikidata concept a "page" (even if at some level it will be implemented partly through a mediawiki page). Avoiding "page" and "page" may make it easier to see the independence of the wikidata concept.
----
Gregor
This is a good discussion. The true dynamics won't be known until you've got live users on the system, but based on what I've seen with existing Wikipedia edits, the dynamics will be even more complex than predicted so far (which is already pretty complex!).
Some other things to consider:
- the focus of Wikipedia articles drifts over time (with good feedback loops built in to the system, this should hopefully be self-correcting)
- label/description disagreement occurs - title says one thing, first few sentences (which is often all people scan when working quickly) say something different, the article taken as a whole is about a third thing
- you'll see different behavior depending on whether you track by article number (internal ID) or article title
- the granularity of Wikipedia articles depends on the length of the text, not just semantics. Concepts with lots of text get split across multiple articles (e.g. WW II), while concepts which don't have much written about them risk getting combined into composite articles about multiple concepts.
- redirects are used for: aliases, misspellings, "see instead" references to semantically different articles, and probably other things that I'm not aware of. This can complicate doing something meaningful with them.
Another source for data on the current articles and their behavior is Freebase. Wikipedia based topics which have been split or combined retain an audit trail that lets you figure out what happened. It only covers the last 5 years and only English Wikipedia, but within those limitations it could provide some interesting insights. I'm happy to help anyone who wants to work with this data.
Tom