Apologies, should this already have been discussed.
http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model#The_Metamodel defines: Wikipedialink = (Title, LanguageId, Badge?)
In my experience the scope or extent of entities in different Wikipedias sometimes differs. One Wikipedia considers an entity a valid lemma, whereas another Wikipedia subsumes it in a larger lemma.
Would changing the model to: Wikipedialink = (Title, LanguageId, Relation, Badge?) where relation can be broader match close match exact match narrower match
etc. help in expressing these situations? I believe this could prevent problems later on.
The default (and initial import of interlanguage links) could easily be "close match" - to be refined only where required.
Gregor
I wholeheartedly support. Having done a good deal of inter-language linking on WP, many times I was forced to make a call on whether two similar articles, yet with different scope, should be linked.
A little disclaimer regarding http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model:
This page does not reflect the most current status of our discussions/designs. A much refined and more consistent description of the data model will soon be published. Please stay tuned ...
Markus
On 01/04/12 09:35, Ivan Cherevko wrote:
I wholeheartedly support. Having done a good deal of inter-language linking on WP, many times I was forced to make a call on whether two similar articles, yet with different scope, should be linked.
-- Ivan Cherevko | +38 050 382 97 88
On Sunday, April 1, 2012 at 2:31 AM, Gregor Hagedorn wrote:
Apologies, should this already have been discussed.
http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model#The_Metamodel defines: Wikipedialink = (Title, LanguageId, Badge?)
In my experience the scope or extent of entities in different Wikipedias sometimes differs. One Wikipedia considers an entity a valid lemma, whereas another Wikipedia subsumes it in a larger lemma.
Would changing the model to: Wikipedialink = (Title, LanguageId, Relation, Badge?) where relation can be broader match close match exact match narrower match
etc. help in expressing these situations? I believe this could prevent problems later on.
The default (and initial import of interlanguage links) could easily be "close match" - to be refined only where required.
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 01/04/12 00:31, Gregor Hagedorn wrote:
Apologies, should this already have been discussed.
http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model#The_Metamodel defines: Wikipedialink = (Title, LanguageId, Badge?)
In my experience the scope or extent of entities in different Wikipedias sometimes differs. One Wikipedia considers an entity a valid lemma, whereas another Wikipedia subsumes it in a larger lemma.
Would changing the model to: Wikipedialink = (Title, LanguageId, Relation, Badge?) where relation can be broader match close match exact match narrower match
etc. help in expressing these situations? I believe this could prevent problems later on.
The default (and initial import of interlanguage links) could easily be "close match" - to be refined only where required.
This is a valid point. It is intended to address this as follows:
* Wikidata items (our "content pages") will be in *exact* correspondence to (zero or more) Wikipedia articles in different languages.
* Differences in scope will lead to different Wikidata items.
* Relationships such as "broader" or "narrower" can be expressed as relations between these items, if desired.
The advantage of this is that the possible relationships are not system-defined but can be selected and modified by the community.
In general, Wikidata will not be able to replace all interwiki links: it will remain possible to define additional links in each Wikipedia to cover cases where the relationship between articles is not exact. In the end, Wikidata does not intend to capture relationships between article texts (e.g., "has a paragraph about this subject" or "contains further information on this topic") but relationships between the entities that the articles are about (e.g., "is capital of" or "was composed by"). This naturally limits the scope of covering interwiki links.
Markus
On 1 April 2012 13:04, Markus Krötzsch markus.kroetzsch@cs.ox.ac.uk wrote:
This is a valid point. It is intended to address this as follows:
- Wikidata items (our "content pages") will be in *exact* correspondence to
(zero or more) Wikipedia articles in different languages.
- Differences in scope will lead to different Wikidata items.
- Relationships such as "broader" or "narrower" can be expressed as
relations between these items, if desired.
This is a technically valid solution. Socially, I fear it would lead to endless uncertainty which mechanism to use. Few abstract entities will have exactly the same delimitation/width, but where should one switch from one method of linking (one wikidata page with several more less closely matching wikipedia pages) to the other (several wikidata pages, one for each wikipedia page in each language)?
Also, importing data will be a nightmare, because the concepts used in imported data will have to be compared with all wikipedias. One Wikipedia-language-version has the post-WWII extent of Russia as well as the current and another Wikipedia-language-version has them separated. It may not have mattered before and only one Wikidata page links to both language-versions. However at some point historical data are imported and suddently Wikidata needs to be reorganized to have two pages. ... Just thinking loud - this may be unavoidable perhaps...
However, my gut feeling is that if you plan to avoid relations between Wikidata and Wikipedia, it might be a more comprehensible model to then always using only one method, i.e. have a 0 to 1 or 1 to 1 relation between Wikidata page and Wikipedia page only, and express everything else in Wikidata to Wikidata page relations. These relations are then easily traceable and updateable, just as the broadness or narrowness of a page in a given Wikipedia develops over time.
In general, Wikidata will not be able to replace all interwiki links: it will remain possible to define additional links in each Wikipedia to cover cases where the relationship between articles is not exact.
This worries me. It means that there will be forever conflicting systems of editing interwiki links. If everything can be achieved with Wikipedia, but only a subset with Wikidata, it spells social adoption danger.
Scope is also called domain by some language folks. Basically two entries can be textually identical but still describe completly different topics. For example "web" as in fabric and in networking.
In Wikipedia similar concepts often gets a common article, and often without explicitly stating the differences.
Sometimes differences goes unnoticed because of cultural differences. Those can be very difficult to solve.
Jeblad On 1. apr. 2012 21.25, "Gregor Hagedorn" g.m.hagedorn@gmail.com wrote:
On 1 April 2012 13:04, Markus Krötzsch markus.kroetzsch@cs.ox.ac.uk wrote:
This is a valid point. It is intended to address this as follows:
- Wikidata items (our "content pages") will be in *exact* correspondence
to
(zero or more) Wikipedia articles in different languages.
- Differences in scope will lead to different Wikidata items.
- Relationships such as "broader" or "narrower" can be expressed as
relations between these items, if desired.
This is a technically valid solution. Socially, I fear it would lead to endless uncertainty which mechanism to use. Few abstract entities will have exactly the same delimitation/width, but where should one switch from one method of linking (one wikidata page with several more less closely matching wikipedia pages) to the other (several wikidata pages, one for each wikipedia page in each language)?
Also, importing data will be a nightmare, because the concepts used in imported data will have to be compared with all wikipedias. One Wikipedia-language-version has the post-WWII extent of Russia as well as the current and another Wikipedia-language-version has them separated. It may not have mattered before and only one Wikidata page links to both language-versions. However at some point historical data are imported and suddently Wikidata needs to be reorganized to have two pages. ... Just thinking loud - this may be unavoidable perhaps...
However, my gut feeling is that if you plan to avoid relations between Wikidata and Wikipedia, it might be a more comprehensible model to then always using only one method, i.e. have a 0 to 1 or 1 to 1 relation between Wikidata page and Wikipedia page only, and express everything else in Wikidata to Wikidata page relations. These relations are then easily traceable and updateable, just as the broadness or narrowness of a page in a given Wikipedia develops over time.
In general, Wikidata will not be able to replace all interwiki links: it will remain possible to define additional links in each Wikipedia to
cover
cases where the relationship between articles is not exact.
This worries me. It means that there will be forever conflicting systems of editing interwiki links. If everything can be achieved with Wikipedia, but only a subset with Wikidata, it spells social adoption danger.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
2012/4/1 Gregor Hagedorn g.m.hagedorn@gmail.com
In general, Wikidata will not be able to replace all interwiki links: it will remain possible to define additional links in each Wikipedia to
cover
cases where the relationship between articles is not exact.
Would it be possible to include this functionality to Wikidata? In my wiki many people don't like interwiki bots "littering the pages histories", and a great advantage of Wikidata could be in cleaner histories.
Hi Gregor,
the main reply to your concerns is that we are not ruling out that further things will be done once the basic system is in place. Most of the doors that you mention will remain open. The approach in Wikidata is to deliver results early and frequently to the community. Therefore, we have deliberately limited our scope in various ways, focussing on our core tasks. This will already cover a huge amount of hitherto unmanaged data, and it already bears many challenges that need to be addressed first. But we certainly hope that this is only the starting point for Wikidata.
Cheers,
Markus
On 01/04/12 21:25, Gregor Hagedorn wrote:
On 1 April 2012 13:04, Markus Krötzschmarkus.kroetzsch@cs.ox.ac.uk wrote:
This is a valid point. It is intended to address this as follows:
- Wikidata items (our "content pages") will be in *exact* correspondence to
(zero or more) Wikipedia articles in different languages.
- Differences in scope will lead to different Wikidata items.
- Relationships such as "broader" or "narrower" can be expressed as
relations between these items, if desired.
This is a technically valid solution. Socially, I fear it would lead to endless uncertainty which mechanism to use. Few abstract entities will have exactly the same delimitation/width, but where should one switch from one method of linking (one wikidata page with several more less closely matching wikipedia pages) to the other (several wikidata pages, one for each wikipedia page in each language)?
Also, importing data will be a nightmare, because the concepts used in imported data will have to be compared with all wikipedias. One Wikipedia-language-version has the post-WWII extent of Russia as well as the current and another Wikipedia-language-version has them separated. It may not have mattered before and only one Wikidata page links to both language-versions. However at some point historical data are imported and suddently Wikidata needs to be reorganized to have two pages. ... Just thinking loud - this may be unavoidable perhaps...
However, my gut feeling is that if you plan to avoid relations between Wikidata and Wikipedia, it might be a more comprehensible model to then always using only one method, i.e. have a 0 to 1 or 1 to 1 relation between Wikidata page and Wikipedia page only, and express everything else in Wikidata to Wikidata page relations. These relations are then easily traceable and updateable, just as the broadness or narrowness of a page in a given Wikipedia develops over time.
In general, Wikidata will not be able to replace all interwiki links: it will remain possible to define additional links in each Wikipedia to cover cases where the relationship between articles is not exact.
This worries me. It means that there will be forever conflicting systems of editing interwiki links. If everything can be achieved with Wikipedia, but only a subset with Wikidata, it spells social adoption danger.
I certainly appreciate your experience in system design, Markus! Still, I feel strongly about this so I poke another time. :-)
My analysis is that one of the two doors should closed in the first iteration. This iteration could start with a clean system, that is easily analyzed and has the potential of a better learning curve (it does not delegate the difficulty to decide which door to use to the user). If it turns out that both options of relations are needed, I believe it is possible to add them in the next iteration.
My interpretation of the options:
Solution 1 (my preference for the sake of simplicity): ======== A wikidata page reflects an independent entity that may be an exact, close, narrower or broader match with several Wikipedia language versions. That is, each Wikidata page has relations to 0-n Wikipedias, Wiktionaries, or Commons (NEW PROPOSAL to extent beyond Wikipedia alone), each of which is labeled by exact/close/narrower/broader match (SKOS vocabulary).
In addition (NEW PROPOSAL) it allows to express relations to other definitions outside of Wikipedias, Wiktionaries, Commons, esp. were the entity is complex to understand or map to Wikipedia entities. This would be a natural extension of the Wikipedias, Wiktionaries, or Commons to the entire semantic web.
An option to be researched over time would be, whether a "defining link" qualifier must be present on one relation, pointing either to a Wikipedia permalink to a given version or to an external permalink.
The great advantage of this system to me is that it can deal efficiently with data where an import is desired, but where the exact mapping to Wikipedia pages is difficult to ascertain.
Solution 2 (clear cut version of your present proposal, perhaps cleanest solution): ========
A wikidata page reflects 0-1 Wikipedias, Wiktionaries, or Commons pages. Relations such as exact/close/narrower/broader between the language versions (the interlanguage links) are stated only between Wikidata pages.
Advantage to me: Clearcut design. As Wikipedia pages develop and become broader or narrower in scope, only the relations between wikidata objects need to be changed. However, property data of the Wikidata page may become false as the linked Wikipedia page in a given language changes.
Solution 3 (your present preference to express relations with two structurally separate means): ======== A wikidata page reflects 0-n Wikipedias, Wiktionaries, or Commons pages. If two pages are an exact or close match, they are stored as multiple Wikipedia-Links to a single Wikidata page. A differentiation between close and exact match is not possible.
If a given language version of Wikipedia is sufficiently different from another one, it must be linked to a newly created, independent Wikidata pages. Relations between Wikidata pages can be qualified as narrower/broader (but not close or exact match, these are required to be given the same Wikidata page.
Advantage as I see them: None.
Potential problems with solution 3:
1. The user is expected to use two widely different actions depending on whether two language versions are sufficiently closely matching or not. The actions are structurally different, and it is unlikely that this can be hidden by the user interface (because wikidata page object creation and deletions are involved) The burden of the decision where to draw the line is left to the user community. Revisions of this decision within the community require creating or deleting Wikidata objects, and are therefore likely to be difficult to make transparent.
2. Scenario: If two Wikipedia language versions describe more or less the same abstract object, but one is later revised to be more narrow, the other more broadly, a careful study of the changes of the revisions since the creation of the wikidata page object is required, to decide which Wikidata page remains linked to a Wikipedia page, and for which revision a new one must be created. Or whether perhaps two new ones must be created?
apologies for pestering you with this...
Gregor
On 02/04/12 19:52, Gregor Hagedorn wrote:
I certainly appreciate your experience in system design, Markus! Still, I feel strongly about this so I poke another time. :-)
We should take care not to overrate this topic. There are hundreds of thousands of articles that have a unique, exact match between different language versions. Cities, countries, people, works of art, species, astronomic objects, chemical elements, car models, airports, ... I could continue forever -- all of these entities have a clear-cut agreed-upon identity that is not language dependent and that suffices for most purposes. I would go further and say that, if a concept has no such clear identity, then it is much less useful to store data about it. I am not at all concerned that different language versions of Wikipedia will use largely disjoint sets of Wikidata items due to small (but somehow essential) differences in meaning. It will happen, but usually for good reasons or as a temporary problem that can be addressed.
Please note that it does not matter if different communities have a different policy about what is written in an article about, say, a car model. Of course, two such articles will never be matching exactly, and always have a bit more or less information. However, for Wikidata it is only important that they are about the same car model. This will occur in a large number of cases.
Regards,
Markus
My analysis is that one of the two doors should closed in the first iteration. This iteration could start with a clean system, that is easily analyzed and has the potential of a better learning curve (it does not delegate the difficulty to decide which door to use to the user). If it turns out that both options of relations are needed, I believe it is possible to add them in the next iteration.
My interpretation of the options:
Solution 1 (my preference for the sake of simplicity):
A wikidata page reflects an independent entity that may be an exact, close, narrower or broader match with several Wikipedia language versions. That is, each Wikidata page has relations to 0-n Wikipedias, Wiktionaries, or Commons (NEW PROPOSAL to extent beyond Wikipedia alone), each of which is labeled by exact/close/narrower/broader match (SKOS vocabulary).
In addition (NEW PROPOSAL) it allows to express relations to other definitions outside of Wikipedias, Wiktionaries, Commons, esp. were the entity is complex to understand or map to Wikipedia entities. This would be a natural extension of the Wikipedias, Wiktionaries, or Commons to the entire semantic web.
An option to be researched over time would be, whether a "defining link" qualifier must be present on one relation, pointing either to a Wikipedia permalink to a given version or to an external permalink.
The great advantage of this system to me is that it can deal efficiently with data where an import is desired, but where the exact mapping to Wikipedia pages is difficult to ascertain.
Solution 2 (clear cut version of your present proposal, perhaps cleanest solution): ========
A wikidata page reflects 0-1 Wikipedias, Wiktionaries, or Commons pages. Relations such as exact/close/narrower/broader between the language versions (the interlanguage links) are stated only between Wikidata pages.
Advantage to me: Clearcut design. As Wikipedia pages develop and become broader or narrower in scope, only the relations between wikidata objects need to be changed. However, property data of the Wikidata page may become false as the linked Wikipedia page in a given language changes.
Solution 3 (your present preference to express relations with two structurally separate means): ======== A wikidata page reflects 0-n Wikipedias, Wiktionaries, or Commons pages. If two pages are an exact or close match, they are stored as multiple Wikipedia-Links to a single Wikidata page. A differentiation between close and exact match is not possible.
If a given language version of Wikipedia is sufficiently different from another one, it must be linked to a newly created, independent Wikidata pages. Relations between Wikidata pages can be qualified as narrower/broader (but not close or exact match, these are required to be given the same Wikidata page.
Advantage as I see them: None.
Potential problems with solution 3:
- The user is expected to use two widely different actions depending
on whether two language versions are sufficiently closely matching or not. The actions are structurally different, and it is unlikely that this can be hidden by the user interface (because wikidata page object creation and deletions are involved) The burden of the decision where to draw the line is left to the user community. Revisions of this decision within the community require creating or deleting Wikidata objects, and are therefore likely to be difficult to make transparent.
- Scenario: If two Wikipedia language versions describe more or less
the same abstract object, but one is later revised to be more narrow, the other more broadly, a careful study of the changes of the revisions since the creation of the wikidata page object is required, to decide which Wikidata page remains linked to a Wikipedia page, and for which revision a new one must be created. Or whether perhaps two new ones must be created?
apologies for pestering you with this...
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Just for documentation in the hope this is later discussed and analyzed, where there is time:
Note that you use car models as a case where there is no problem. I used the same example previously as an example of problems... :-)
One Wikipedia has one article per car model revision (making it easy to store data on it), another on article per model name (with different data for the various runs, like 2000-2006, 2006-current). Almost always the data about the revisions will differ. But how much? If any feature is different, weight must differ, but it will even differ for customization, so it is probably necessary to ignore this and subsume it under a higher class. Is different length also irrelevant? Similarly: Models in different country that are sold under different names may be are sufficiently identical or not.
This will come out if we store data on the entities, whereas it is irrelevant for the present interlanguage links. The community will have to make decisions, and the Wikidata structural and user interaction model will have to support forth and back changes while a community discussion process if ongoing. The reason I ask for an analysis of the options (I gave three) is that this must be supported by wikidata.
2012/4/2 Markus Krötzsch markus.kroetzsch@cs.ox.ac.uk
We should take care not to overrate this topic. There are hundreds of thousands of articles that have a unique, exact match between different language versions.
Fortunately. But problems and surprises in informatics are always in the minority of cases. :-)
Cities,
Settlements are drawn together and dissected all the time. Rome may be one city for one Wikipedia, separate ancient Rome and modern Rome for the other. The newest problem in huwiki is a small part of Budapest that has two articles, one for it as part of the modern city, and one for the historical standalone settlement (same place, same name). Some wikis sraw articles together for notability reasons, others dissect because of extent.
countries,
These are really interesting! Can you tell what Germany is in the terms of history? Or just Prussia? Prussia is a country that's name leads to a disambiguation page in many wikis, and no warranty the standalone articles will match. Or what about Yugoslavia as a country? One wiki may think to write one article about countries by this name, while another will handle it in several articles.
works of art,
One wiki writes one article about Leonardo's paintings, the other one for each. This may be handled by linking to section titles, perhaps? While entities have a clear meaning, articles about them may not exactly match.
astronomic objects,
The same: one article about the moons of Uranus vs. one for each.
Etc.
We are going very soon to look into the actual data. If my current assumption holds, that more than 99% of all links are simple 1:1 links, the requirement for a more complex system might be deferred to later.
If there is a big fraction of language links currently representing a more intricate structure, we have to rethink that assumption. But until then, I hope we can start with the simpler solution.
Cheers, Denny
2012/4/3 Bináris wikiposta@gmail.com
2012/4/2 Markus Krötzsch markus.kroetzsch@cs.ox.ac.uk
We should take care not to overrate this topic. There are hundreds of thousands of articles that have a unique, exact match between different language versions.
Fortunately. But problems and surprises in informatics are always in the minority of cases. :-)
Cities,
Settlements are drawn together and dissected all the time. Rome may be one city for one Wikipedia, separate ancient Rome and modern Rome for the other. The newest problem in huwiki is a small part of Budapest that has two articles, one for it as part of the modern city, and one for the historical standalone settlement (same place, same name). Some wikis sraw articles together for notability reasons, others dissect because of extent.
countries,
These are really interesting! Can you tell what Germany is in the terms of history? Or just Prussia? Prussia is a country that's name leads to a disambiguation page in many wikis, and no warranty the standalone articles will match. Or what about Yugoslavia as a country? One wiki may think to write one article about countries by this name, while another will handle it in several articles.
works of art,
One wiki writes one article about Leonardo's paintings, the other one for each. This may be handled by linking to section titles, perhaps? While entities have a clear meaning, articles about them may not exactly match.
astronomic objects,
The same: one article about the moons of Uranus vs. one for each.
Etc.
-- Bináris
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
2012/4/3 Denny Vrandečić denny.vrandecic@wikimedia.de
We are going very soon to look into the actual data. If my current assumption holds, that more than 99% of all links are simple 1:1 links, the requirement for a more complex system might be deferred to later.
Deferring is OK if it helps in quicker start and won't cause problems later when there will be the turn of more sophisicated cases (but this is really developers' competence).
I am interested in developing bots using API and I would appreciate a quick start with a smaller body so that basic scripts could be written and tested.
Without discussing each point, let me just pick out one example:
astronomic objects,
The same: one article about the moons of Uranus vs. one for each.
Wikidata can (and probably will) store information about each moon of Uranus, e.g., its mass. It does probably not make sense to store the mass of "Moons of Uranus" if there is such an article. It does not help to know that the article "Moons on Uranus" also talks (among other things) about some moon that has a particular mass: you need to know what *exactly* you are talking about to exploit this data. An article on "Moons of Uranus" could still (eventually) embed Wikidata data to improve its display, but this data must refer to individual moons, not to the article as a whole.
There is no question that there are many difficult modelling cases (one article or many? how many? etc.), but these cases will not go away in any of the alternative proposals that have been made in this thread.
Markus
On 03/04/12 07:37, Bináris wrote:
2012/4/2 Markus Krötzsch <markus.kroetzsch@cs.ox.ac.uk mailto:markus.kroetzsch@cs.ox.ac.uk>
We should take care not to overrate this topic. There are hundreds of thousands of articles that have a unique, exact match between different language versions.
Fortunately. But problems and surprises in informatics are always in the minority of cases. :-)
Cities,
Settlements are drawn together and dissected all the time. Rome may be one city for one Wikipedia, separate ancient Rome and modern Rome for the other. The newest problem in huwiki is a small part of Budapest that has two articles, one for it as part of the modern city, and one for the historical standalone settlement (same place, same name). Some wikis sraw articles together for notability reasons, others dissect because of extent.
countries,
These are really interesting! Can you tell what Germany is in the terms of history? Or just Prussia? Prussia is a country that's name leads to a disambiguation page in many wikis, and no warranty the standalone articles will match. Or what about Yugoslavia as a country? One wiki may think to write one article about countries by this name, while another will handle it in several articles.
works of art,
One wiki writes one article about Leonardo's paintings, the other one for each. This may be handled by linking to section titles, perhaps? While entities have a clear meaning, articles about them may not exactly match.
astronomic objects,
The same: one article about the moons of Uranus vs. one for each.
Etc.
-- Bináris
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata can (and probably will) store information about each moon of Uranus, e.g., its mass. It does probably not make sense to store the mass of "Moons of Uranus" if there is such an article. It does not help to know that the article "Moons on Uranus" also talks (among other things) about some moon that has a particular mass: you need to know what *exactly* you are talking about to exploit this data. An article on "Moons of Uranus" could still (eventually) embed Wikidata data to improve its display, but this data must refer to individual moons, not to the article as a whole.
The problem I see is that you have no definition to which real object the data are tied. We agree that the problem is not the interwiki links per se. It is what results from it. How do we tie data to a wikidata page when we don't know what it is about?
The label and the description together are meant to be identifying.
I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in Hesse, Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
Cheers, Denny
2012/4/5 Gregor Hagedorn g.m.hagedorn@gmail.com
Wikidata can (and probably will) store information about each moon of Uranus, e.g., its mass. It does probably not make sense to store the
mass of
"Moons of Uranus" if there is such an article. It does not help to know
that
the article "Moons on Uranus" also talks (among other things) about some moon that has a particular mass: you need to know what *exactly* you are talking about to exploit this data. An article on "Moons of Uranus" could still (eventually) embed Wikidata data to improve its display, but this
data
must refer to individual moons, not to the article as a whole.
The problem I see is that you have no definition to which real object the data are tied. We agree that the problem is not the interwiki links per se. It is what results from it. How do we tie data to a wikidata page when we don't know what it is about?
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 5 April 2012 18:30, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
The label and the description together are meant to be identifying.
I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in Hesse, Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
I believe it will be difficult to craft labels that work as definitions. A label is hinting, and may often be sufficiently precise for the majority of purposes. If we speak of "Germany" it is very hard to express in a simple string the different historical, geographical, political delimitations that this term may carry.
In my own field of work even technical terms are often difficult to resolve to a definition. In biology, the width of taxon delimitations changes over time and with new research, and even technical terms in morphologoy often have quite different meanings, depending on the "school" that is being followed.
Or to cite a car example again: The label "Renault Kangoo" is unspecific as to the version/revision/release of it, so technical data that vary between these versions can not be added to it. However, the de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also subsumed under "Renault Kangoo". So it is a valid assumption that when labeling something "Renault Kangoo" it refers to both of these identical models sold under different names. But then, the "Nissan Kubistar" is only equivalent to the first version/revision/release of the "Renault Kangoo"...
This is not unsolvable, but if you want to import or add data to an element, it will be very hard to judge from a short label the correct concept. I was hoping that linking this to Wikipedia articles would help, but this will be hard if a Wikidata page is linked to 40 Wikipedias, any given Wikidata editor can read only a handful of, and with no support to distinguish between exactMatch and closeMatch.
My suggestions is to allow a differentiation of exactMatch and closeMatch and instruct editors to use at least one exact match, and considers this or these the defining wikipedia pages, whereas other are added as close match.
Of course, the label will remain useful to stumble of changes in definition of width of concept over time, and correct those after consulting the revision number to which the original links was formed (not present, but perhaps achievable by some timestamping and comparison?)
Gregor
Regarding definitions:
Note that I said "Label + Description is identifying", not merely the label. I assume this to be true because even for your example of "Germany", the disambiguation page works with rather short descriptions of each disambiguated page [1]. So even that fuzzy concept that you gave an example seems to be sufficiently identifiable for the sake and mission of the Wikipedia community, which gives me reason to believe that the community can sort this out. I mean, they basically already had!
Regarding the Kangoo / Kubistar example:
In Wikidata they would be represented as two pages, one for the Kubistar (which would link to the Danish and German page for the Kubistar), and one for the Kangoo (which would link to the 20 language versions of the Kangoo article, including a Danish and a German one). This is a rather simple example, which would be easily expressed with the exact matches that we suggest.
In Wikidata, the Wikipedia links are planned to be inverse functional - i.e., every Wikipedia article in a specific language can only be linked to from one single Wikidata article. Two Wikidata pages cannot claim the same Wikipedia article in a single language as their defining article.
I.e. in the Kubistar/Kangoo example there would be two Wikidata pages. One about the Kubistar, linking to de:Nissan_Kubistar and da:Nissan_Kubistar, and one about the Kangoo, linking to the 20 different Kangoo articles. The Wikidata page for Kubistar could not link to any of those Kangoo articles.
Please do not misunderstand, I am not categorically against nonexact matches or broader or narrower (or else I wouldn't be discussing). But I haven't seen examples yet that convince me that the additional complexity of broader/narrower or unexact is required. As I said before, if we can model more than 99% of all language links with the suggested simple solution, I am reluctant to make it more complicated for the remaining <1%.
Cheers, Denny
P.S.: oh, yes, indeed! Thank you for this excellent and interesting discussion, it really does shed light on some of the aspects of the current draft of the data model, and will eventually improve it and sharpen the understanding of the model.
[1] https://en.wikipedia.org/wiki/Germany_(disambiguation)
2012/4/5 Gregor Hagedorn g.m.hagedorn@gmail.com
On 5 April 2012 18:30, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
The label and the description together are meant to be identifying.
I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in
Hesse,
Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
I believe it will be difficult to craft labels that work as definitions. A label is hinting, and may often be sufficiently precise for the majority of purposes. If we speak of "Germany" it is very hard to express in a simple string the different historical, geographical, political delimitations that this term may carry.
In my own field of work even technical terms are often difficult to resolve to a definition. In biology, the width of taxon delimitations changes over time and with new research, and even technical terms in morphologoy often have quite different meanings, depending on the "school" that is being followed.
Or to cite a car example again: The label "Renault Kangoo" is unspecific as to the version/revision/release of it, so technical data that vary between these versions can not be added to it. However, the de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also subsumed under "Renault Kangoo". So it is a valid assumption that when labeling something "Renault Kangoo" it refers to both of these identical models sold under different names. But then, the "Nissan Kubistar" is only equivalent to the first version/revision/release of the "Renault Kangoo"...
This is not unsolvable, but if you want to import or add data to an element, it will be very hard to judge from a short label the correct concept. I was hoping that linking this to Wikipedia articles would help, but this will be hard if a Wikidata page is linked to 40 Wikipedias, any given Wikidata editor can read only a handful of, and with no support to distinguish between exactMatch and closeMatch.
My suggestions is to allow a differentiation of exactMatch and closeMatch and instruct editors to use at least one exact match, and considers this or these the defining wikipedia pages, whereas other are added as close match.
Of course, the label will remain useful to stumble of changes in definition of width of concept over time, and correct those after consulting the revision number to which the original links was formed (not present, but perhaps achievable by some timestamping and comparison?)
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi All,
Our data (using a 25-language dataset) agrees with Denny's. 99% of all connected components of the interlanguage link graph have only one article per language edition. This is something we looked into in some detail in our paper at ACM's CHI conference this year (http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf).
However, it is important to point out that the 1% tends to contain articles that are of great general interest. Some English articles that occur in these situations include, "author", "art", "indigenous people", "education", "privacy", "liberal arts", "computer science", "agriculture", "socialism", "army", etc. To a certain extent, this is to be expected. Where there is more global interest in a topic, there is going to be more ambiguity.
Just my two cents.
- Brent
Brent Hecht Ph.D. Candidate in Computer Science CollabLab: The Collaborative Technology Laboratory Northwestern University w: http://www.brenthecht.com e: brent@u.northwestern.edu
On Apr 5, 2012, at 4:50 PM, Denny Vrandečić wrote:
Regarding definitions:
Note that I said "Label + Description is identifying", not merely the label. I assume this to be true because even for your example of "Germany", the disambiguation page works with rather short descriptions of each disambiguated page [1]. So even that fuzzy concept that you gave an example seems to be sufficiently identifiable for the sake and mission of the Wikipedia community, which gives me reason to believe that the community can sort this out. I mean, they basically already had!
Regarding the Kangoo / Kubistar example:
In Wikidata they would be represented as two pages, one for the Kubistar (which would link to the Danish and German page for the Kubistar), and one for the Kangoo (which would link to the 20 language versions of the Kangoo article, including a Danish and a German one). This is a rather simple example, which would be easily expressed with the exact matches that we suggest.
In Wikidata, the Wikipedia links are planned to be inverse functional - i.e., every Wikipedia article in a specific language can only be linked to from one single Wikidata article. Two Wikidata pages cannot claim the same Wikipedia article in a single language as their defining article.
I.e. in the Kubistar/Kangoo example there would be two Wikidata pages. One about the Kubistar, linking to de:Nissan_Kubistar and da:Nissan_Kubistar, and one about the Kangoo, linking to the 20 different Kangoo articles. The Wikidata page for Kubistar could not link to any of those Kangoo articles.
Please do not misunderstand, I am not categorically against nonexact matches or broader or narrower (or else I wouldn't be discussing). But I haven't seen examples yet that convince me that the additional complexity of broader/narrower or unexact is required. As I said before, if we can model more than 99% of all language links with the suggested simple solution, I am reluctant to make it more complicated for the remaining <1%.
Cheers, Denny
P.S.: oh, yes, indeed! Thank you for this excellent and interesting discussion, it really does shed light on some of the aspects of the current draft of the data model, and will eventually improve it and sharpen the understanding of the model.
[1] https://en.wikipedia.org/wiki/Germany_(disambiguation)
2012/4/5 Gregor Hagedorn g.m.hagedorn@gmail.com On 5 April 2012 18:30, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
The label and the description together are meant to be identifying.
I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in Hesse, Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
I believe it will be difficult to craft labels that work as definitions. A label is hinting, and may often be sufficiently precise for the majority of purposes. If we speak of "Germany" it is very hard to express in a simple string the different historical, geographical, political delimitations that this term may carry.
In my own field of work even technical terms are often difficult to resolve to a definition. In biology, the width of taxon delimitations changes over time and with new research, and even technical terms in morphologoy often have quite different meanings, depending on the "school" that is being followed.
Or to cite a car example again: The label "Renault Kangoo" is unspecific as to the version/revision/release of it, so technical data that vary between these versions can not be added to it. However, the de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also subsumed under "Renault Kangoo". So it is a valid assumption that when labeling something "Renault Kangoo" it refers to both of these identical models sold under different names. But then, the "Nissan Kubistar" is only equivalent to the first version/revision/release of the "Renault Kangoo"...
This is not unsolvable, but if you want to import or add data to an element, it will be very hard to judge from a short label the correct concept. I was hoping that linking this to Wikipedia articles would help, but this will be hard if a Wikidata page is linked to 40 Wikipedias, any given Wikidata editor can read only a handful of, and with no support to distinguish between exactMatch and closeMatch.
My suggestions is to allow a differentiation of exactMatch and closeMatch and instruct editors to use at least one exact match, and considers this or these the defining wikipedia pages, whereas other are added as close match.
Of course, the label will remain useful to stumble of changes in definition of width of concept over time, and correct those after consulting the revision number to which the original links was formed (not present, but perhaps achievable by some timestamping and comparison?)
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Thanks, Brent! I was hoping to get some numbers exactly from you :)
I am extremely curious what kind of statements people will make in the Wikidata page about "art", "privacy", "agriculture", "army", etc. I am looking forward to see what the community will add there. That'll be fun to watch :)
(Usually, such things tend to be retroactively obvious, but extremely hard to predict :) )
Cheers, Denny
2012/4/6 Brent Hecht brent@u.northwestern.edu
Hi All,
Our data (using a 25-language dataset) agrees with Denny's. 99% of all connected components of the interlanguage link graph have only one article per language edition. This is something we looked into in some detail in our paper at ACM's CHI conference this year ( http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf).
However, it is important to point out that the 1% tends to contain articles that are of great general interest. Some English articles that occur in these situations include, "author", "art", "indigenous people", "education", "privacy", "liberal arts", "computer science", "agriculture", "socialism", "army", etc. To a certain extent, this is to be expected. Where there is more global interest in a topic, there is going to be more ambiguity.
Just my two cents.
- Brent
Brent Hecht Ph.D. Candidate in Computer Science CollabLab: The Collaborative Technology Laboratory Northwestern University w: http://www.brenthecht.com e: brent@u.northwestern.edu
On Apr 5, 2012, at 4:50 PM, Denny Vrandečić wrote:
Regarding definitions:
Note that I said "Label + Description is identifying", not merely the
label. I assume this to be true because even for your example of "Germany", the disambiguation page works with rather short descriptions of each disambiguated page [1]. So even that fuzzy concept that you gave an example seems to be sufficiently identifiable for the sake and mission of the Wikipedia community, which gives me reason to believe that the community can sort this out. I mean, they basically already had!
Regarding the Kangoo / Kubistar example:
In Wikidata they would be represented as two pages, one for the Kubistar
(which would link to the Danish and German page for the Kubistar), and one for the Kangoo (which would link to the 20 language versions of the Kangoo article, including a Danish and a German one). This is a rather simple example, which would be easily expressed with the exact matches that we suggest.
In Wikidata, the Wikipedia links are planned to be inverse functional -
i.e., every Wikipedia article in a specific language can only be linked to from one single Wikidata article. Two Wikidata pages cannot claim the same Wikipedia article in a single language as their defining article.
I.e. in the Kubistar/Kangoo example there would be two Wikidata pages.
One about the Kubistar, linking to de:Nissan_Kubistar and da:Nissan_Kubistar, and one about the Kangoo, linking to the 20 different Kangoo articles. The Wikidata page for Kubistar could not link to any of those Kangoo articles.
Please do not misunderstand, I am not categorically against nonexact
matches or broader or narrower (or else I wouldn't be discussing). But I haven't seen examples yet that convince me that the additional complexity of broader/narrower or unexact is required. As I said before, if we can model more than 99% of all language links with the suggested simple solution, I am reluctant to make it more complicated for the remaining <1%.
Cheers, Denny
P.S.: oh, yes, indeed! Thank you for this excellent and interesting
discussion, it really does shed light on some of the aspects of the current draft of the data model, and will eventually improve it and sharpen the understanding of the model.
[1] https://en.wikipedia.org/wiki/Germany_(disambiguation)
2012/4/5 Gregor Hagedorn g.m.hagedorn@gmail.com On 5 April 2012 18:30, Denny Vrandečić denny.vrandecic@wikimedia.de
wrote:
The label and the description together are meant to be identifying.
I.e. "Georgia - A country in central Asia", or "Frankfurt - A city in
Hesse,
Germany", etc.
Additionally, the Wikipedia links provide quite some guidance to it.
I believe it will be difficult to craft labels that work as definitions. A label is hinting, and may often be sufficiently precise for the majority of purposes. If we speak of "Germany" it is very hard to express in a simple string the different historical, geographical, political delimitations that this term may carry.
In my own field of work even technical terms are often difficult to resolve to a definition. In biology, the width of taxon delimitations changes over time and with new research, and even technical terms in morphologoy often have quite different meanings, depending on the "school" that is being followed.
Or to cite a car example again: The label "Renault Kangoo" is unspecific as to the version/revision/release of it, so technical data that vary between these versions can not be added to it. However, the de.wikipedia.org/wiki/Nissan_Kubistar is in most Wikipedias also subsumed under "Renault Kangoo". So it is a valid assumption that when labeling something "Renault Kangoo" it refers to both of these identical models sold under different names. But then, the "Nissan Kubistar" is only equivalent to the first version/revision/release of the "Renault Kangoo"...
This is not unsolvable, but if you want to import or add data to an element, it will be very hard to judge from a short label the correct concept. I was hoping that linking this to Wikipedia articles would help, but this will be hard if a Wikidata page is linked to 40 Wikipedias, any given Wikidata editor can read only a handful of, and with no support to distinguish between exactMatch and closeMatch.
My suggestions is to allow a differentiation of exactMatch and closeMatch and instruct editors to use at least one exact match, and considers this or these the defining wikipedia pages, whereas other are added as close match.
Of course, the label will remain useful to stumble of changes in definition of width of concept over time, and correct those after consulting the revision number to which the original links was formed (not present, but perhaps achievable by some timestamping and comparison?)
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 04/04/12 23:23, Gregor Hagedorn wrote:
Wikidata can (and probably will) store information about each moon of Uranus, e.g., its mass. It does probably not make sense to store the mass of "Moons of Uranus" if there is such an article. It does not help to know that the article "Moons on Uranus" also talks (among other things) about some moon that has a particular mass: you need to know what *exactly* you are talking about to exploit this data. An article on "Moons of Uranus" could still (eventually) embed Wikidata data to improve its display, but this data must refer to individual moons, not to the article as a whole.
The problem I see is that you have no definition to which real object the data are tied. We agree that the problem is not the interwiki links per se. It is what results from it. How do we tie data to a wikidata page when we don't know what it is about?
This is a hard question. The best answer I can come up with now (on the bus to Oxford) is as follows: the meaning of Wikidata items is subject to social agreement, based on shared experience, communication, and human-language documentation. The latter is provided in labels and descriptions, in Wikipedia articles that are connected to a Wikidata item, and also in Wikidata property pages that document properties.
I know that this may not be a satisfactory answer to your question of how we can *really* *know* what a Wikidata item is about. If you want to dig deeper into this issue, there is a lot of interesting literature, which can give you many more details than I can. What we are dealing with is the well-known philosophical problem of /grounding/. In essence, the state of discussion boils down to the following: there is no known way of connecting the symbols of a purely symbolic system (such as a computer program) to real-world objects in a formal way. Going deeper into the discussion reveals that there is also no agreed-upon way to clarify the meaning of "real" and "object" in the first place.
In spite of all this, humans somehow manage to understand each other, which brings us to the point of how amazing they all are :-) Wikidata is but a humble technical tool that provides an environment for articulating and (I hope) improving this understanding in a novel way. This cannot provide a formal grounding, but it might come as close to this ideal as we have gotten yet.
Regards,
Markus
The problem I see is that you have no definition to which real object the data are tied. We agree that the problem is not the interwiki links per se. It is what results from it. How do we tie data to a wikidata page when we don't know what it is about?
This is a hard question. The best answer I can come up with now (on the bus to Oxford) is as follows: the meaning of Wikidata items is subject to social agreement, based on shared experience, communication, and human-language documentation. The latter is provided in labels and descriptions, in Wikipedia articles that are connected to a Wikidata item, and also in Wikidata property pages that document properties.
I know that this may not be a satisfactory answer to your question of how we can *really* *know* what a Wikidata item is about. If you want to dig deeper into this issue, there is a lot of interesting literature, which can give you many more details than I can. What we are dealing with is the well-known philosophical problem of /grounding/. In essence, the state of discussion boils down to the following: there is no known way of connecting the symbols of a purely symbolic system (such as a computer program) to real-world objects in a formal way. Going deeper into the discussion reveals that there is also no agreed-upon way to clarify the meaning of "real" and "object" in the first place.
In spite of all this, humans somehow manage to understand each other, which brings us to the point of how amazing they all are :-) Wikidata is but a humble technical tool that provides an environment for articulating and (I hope) improving this understanding in a novel way. This cannot provide a formal grounding, but it might come as close to this ideal as we have gotten yet.
I believe I fully agree with what you write. And I believe, we might also agree that the present Wikipedias lemmata (page) are a huge achievement towards these definitions. It is imperfect, frail, everything, but a huge achievement.
My perspective probably differs from yours only in one point: Of course it is possible to start from scratch and have a totally new community start defining the Wikidata page in a consistent, well defined manner, analysing the dimensions of misunderstanding that no single members even anticipates but which surface in a community and when working with the definitions over time. However, I think this is unlikely to happen. It is calling for the big crowdsourcing that magically appears and does the work.
My conclusions: 1. I believe it is a good feature that Wikidata allows to define concepts outside of Wikipedia. 2. I believe the Wikidate design should take more care to expressively align itself with certain, well defined Wikipedia pages, rather than requiring either of: a) a new community to redo all definitions and delimitation inside Wikidata b) require all re-users of Wikidata content inside and outside of Wikipedia to read all linked Wikipedias in all languages and understand the communality of the concept behind it. Distinguishing between closeMatch and exactMatch may do it, or alternatively a new "definingLink" relation may be called for (I am not sure which). 3. As expressed in a separate thread, I believe the links should be broader than Wikipedias, at least including Wiktionaries and Commons, but possibly much more. 4. personally I would include in the link relation role labels the concepts of narrower/broaderMatch rather than delegating this expressiveness to another part.
Thanks to all of you for this excellent discussion!
Gregor