The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Do you think it is a general problem? The few merges that I checked was all done by Fullerene and s/he has now responded after Andrawaag made a note on the talk page https://www.wikidata.org/wiki/User_talk:Fullerene
/Finn
On 10/28/2015 06:07 PM, Benjamin Good wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Yes, I think the problem of maintaining a multi-class data model within wikidata is a general problem. You could imagine similar scenarios in any domain.
Our particular gene/protein merge problem is specific to our work. It is not just one user (Fullerene) though, this has been happening for a while and many have participated. See e.g. the post here: https://www.wikidata.org/wiki/User_talk:Andrawaag#ProteinBoxBot_Mistake.3F and here: https://www.wikidata.org/wiki/User_talk:DGtal#Merging_items
On Wed, Oct 28, 2015 at 10:47 AM, Finn Årup Nielsen fn@imm.dtu.dk wrote:
Do you think it is a general problem? The few merges that I checked was all done by Fullerene and s/he has now responded after Andrawaag made a note on the talk page https://www.wikidata.org/wiki/User_talk:Fullerene
/Finn
On 10/28/2015 06:07 PM, Benjamin Good wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1]
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3]
https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
The below SPARQL counts 14.
Among them are https://www.wikidata.org/wiki/Q238509 which is "5-HT1A receptor human gene" in English and "5-HT₁A-Rezeptor Protein" in German. The last editor is ProteinBoxBot. It is coded by by itself. That item has a split personality, so it seems that we need to do some cleaning.
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?item WHERE { ?item wdt:P352 ?uniprot ; wdt:P353 ?genesymbol . }
I now see that Teugnhausen has also merged https://www.wikidata.org/wiki/Special:Contributions/Teugnhausen
/Finn
On 10/28/2015 06:07 PM, Benjamin Good wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Yup, that is characteristic of our problem. The last proteinboxbot edit was about the protein item.. This query also works (finds things that are both subclass of gene and subclass of protein)
PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wd: http://www.wikidata.org/entity/ SELECT * WHERE { ?gene wdt:P279 wd:Q7187 . ?gene wdt:P279 wd:Q8054 . }
On Wed, Oct 28, 2015 at 11:04 AM, Finn Årup Nielsen fn@imm.dtu.dk wrote:
The below SPARQL counts 14.
Among them are https://www.wikidata.org/wiki/Q238509 which is "5-HT1A receptor human gene" in English and "5-HT₁A-Rezeptor Protein" in German. The last editor is ProteinBoxBot. It is coded by by itself. That item has a split personality, so it seems that we need to do some cleaning.
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?item WHERE { ?item wdt:P352 ?uniprot ; wdt:P353 ?genesymbol . }
I now see that Teugnhausen has also merged https://www.wikidata.org/wiki/Special:Contributions/Teugnhausen
/Finn
On 10/28/2015 06:07 PM, Benjamin Good wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1]
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3]
https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
One approach might be to see whether the merge script could check for property P1889 "different from", and refuse to merge any pairs marked in that way.
Then you could specify that all of the gene items were intentionally different from the protein items.
-- James.
On 28/10/2015 18:10, Benjamin Good wrote:
Yup, that is characteristic of our problem. The last proteinboxbot edit was about the protein item.. This query also works (finds things that are both subclass of gene and subclass of protein)
PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wd: http://www.wikidata.org/entity/ SELECT * WHERE { ?gene wdt:P279 wd:Q7187 . ?gene wdt:P279 wd:Q8054 . }
On Wed, Oct 28, 2015 at 11:04 AM, Finn Årup Nielsen fn@imm.dtu.dk wrote:
The below SPARQL counts 14.
Among them are https://www.wikidata.org/wiki/Q238509 which is "5-HT1A receptor human gene" in English and "5-HT₁A-Rezeptor Protein" in German. The last editor is ProteinBoxBot. It is coded by by itself. That item has a split personality, so it seems that we need to do some cleaning.
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?item WHERE { ?item wdt:P352 ?uniprot ; wdt:P353 ?genesymbol . }
I now see that Teugnhausen has also merged https://www.wikidata.org/wiki/Special:Contributions/Teugnhausen
/Finn
On 10/28/2015 06:07 PM, Benjamin Good wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1]
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3]
https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
This is a deep-seated semantic confusion going back to at least 2006 [1] when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively adopted in its initial protein schema in 2007 when it was importing from those infoboxes. Although it made some progress in improving the schema later, anything not aligned with how Wikipedians want to do things is shoveling against the tide. It's also very difficult to manage equivalences when Wikipedia articles are about multiple things like the protein/gene articles.
If you look at the recent merge of Reelin [3] you can see that it was done by the same user who contributed substantially to the article back in 2006 [4], so clearly, as the "owner" of that article, they clearly know what's best. :-) It's going to be very difficult to get people to unlearn a decade of habits.
Another issue is that, as soon as you start trying to split things out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.
This problem of chunking by notability (or lack thereof), length of text article, relatedness, and other attributes rather than semantic individuality is much more widespread than just proteins/genes. It also effects things like pairs (or small sets) of people who aren't notable enough to have an article on their own, articles which contain infoboxes about people who aren't notable, so they got tacked onto related article to give them a how, etc.
The inverse problem exists as well where a single semantic topic is broken up into multiple articles purely for reasons of length. Other types of semantic mismatches include articles along precoordinated facets like Transportation in New York City (or even History of Transportation in New York City!), list articles (* Filmography, * Discography, * Videography, List of *). Of course, some lists, like the Fortune 500, make sense to talk about as entities, but most Wikipedia lists are just mechanically generated things for human browsing which don't really need a semantic identifier. Freebase deleted most of this Wikipedia cruft.
Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets of incompatible types [5] which was used by the merge tools to warn users that the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could make it a hard restriction or just a warning.
Tom
[1] https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldi... [2] http://www.freebase.com/biology/protein/entrez_gene_id [3] https://www.wikidata.org/w/index.php?title=Q414043&type=revision&dif... [4] https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=hi... [5] http://www.freebase.com/dataworld/incompatible_types?instances=
On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good ben.mcgee.good@gmail.com wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
BTW, merges aren't the only problem. For all languages except English, it's the protein Wikidata item [1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].
[1] https://www.wikidata.org/wiki/Q13561329 [2] https://www.wikidata.org/wiki/Q414043 [3] https://en.wikipedia.org/wiki/Reelin
On Wed, Oct 28, 2015 at 3:08 PM, Tom Morris tfmorris@gmail.com wrote:
This is a deep-seated semantic confusion going back to at least 2006 [1] when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively adopted in its initial protein schema in 2007 when it was importing from those infoboxes. Although it made some progress in improving the schema later, anything not aligned with how Wikipedians want to do things is shoveling against the tide. It's also very difficult to manage equivalences when Wikipedia articles are about multiple things like the protein/gene articles.
If you look at the recent merge of Reelin [3] you can see that it was done by the same user who contributed substantially to the article back in 2006 [4], so clearly, as the "owner" of that article, they clearly know what's best. :-) It's going to be very difficult to get people to unlearn a decade of habits.
Another issue is that, as soon as you start trying to split things out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.
This problem of chunking by notability (or lack thereof), length of text article, relatedness, and other attributes rather than semantic individuality is much more widespread than just proteins/genes. It also effects things like pairs (or small sets) of people who aren't notable enough to have an article on their own, articles which contain infoboxes about people who aren't notable, so they got tacked onto related article to give them a how, etc.
The inverse problem exists as well where a single semantic topic is broken up into multiple articles purely for reasons of length. Other types of semantic mismatches include articles along precoordinated facets like Transportation in New York City (or even History of Transportation in New York City!), list articles (* Filmography, * Discography, * Videography, List of *). Of course, some lists, like the Fortune 500, make sense to talk about as entities, but most Wikipedia lists are just mechanically generated things for human browsing which don't really need a semantic identifier. Freebase deleted most of this Wikipedia cruft.
Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets of incompatible types [5] which was used by the merge tools to warn users that the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could make it a hard restriction or just a warning.
Tom
[1] https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldi... [2] http://www.freebase.com/biology/protein/entrez_gene_id [3] https://www.wikidata.org/w/index.php?title=Q414043&type=revision&dif... [4] https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=hi... [5] http://www.freebase.com/dataworld/incompatible_types?instances=
On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good ben.mcgee.good@gmail.com wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Tom is right. And we loved the Rules engine that we had in place to deal with incompatible types. It caught lots of nasty things that would go unnoticed and lower our data quality in Freebase.
As I said before in another thread, Wikidata needs some sort of Rules engine to deal with the cross domain issues or incompatible statements during merges, etc, as Ben is experiencing.
Thad +ThadGuidry https://www.google.com/+ThadGuidry
It sounds like Tom and James have basically the same idea for our particular problem, which I would support: enable a warning in the merge script when incompatible types are detected. These would have to be encoded somehow though - presumably in the property constraints.
Tom, a couple points of clarification:
Tom>>>Another issue is that, as soon as you start trying to split things
out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.
That hasn't been a problem for us so far. We have already added all human and all mouse genes to wikidata and people seem okay with that.
Tom>>> For all languages except English, it's the protein Wikidata item
[1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].
I don't think that this is ubiquitously true, though it is true in many cases. This happened because the original imports from Wikipedia tagged the wikidata items about gene/proteins as proteins. We converted all the EN Wikilinks that we knew about programmatically but shied away from doing that for all the other languages.
On Wed, Oct 28, 2015 at 12:22 PM, Tom Morris tfmorris@gmail.com wrote:
BTW, merges aren't the only problem. For all languages except English, it's the protein Wikidata item [1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].
[1] https://www.wikidata.org/wiki/Q13561329 [2] https://www.wikidata.org/wiki/Q414043 [3] https://en.wikipedia.org/wiki/Reelin
On Wed, Oct 28, 2015 at 3:08 PM, Tom Morris tfmorris@gmail.com wrote:
This is a deep-seated semantic confusion going back to at least 2006 [1] when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively adopted in its initial protein schema in 2007 when it was importing from those infoboxes. Although it made some progress in improving the schema later, anything not aligned with how Wikipedians want to do things is shoveling against the tide. It's also very difficult to manage equivalences when Wikipedia articles are about multiple things like the protein/gene articles.
If you look at the recent merge of Reelin [3] you can see that it was done by the same user who contributed substantially to the article back in 2006 [4], so clearly, as the "owner" of that article, they clearly know what's best. :-) It's going to be very difficult to get people to unlearn a decade of habits.
Another issue is that, as soon as you start trying to split things out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.
This problem of chunking by notability (or lack thereof), length of text article, relatedness, and other attributes rather than semantic individuality is much more widespread than just proteins/genes. It also effects things like pairs (or small sets) of people who aren't notable enough to have an article on their own, articles which contain infoboxes about people who aren't notable, so they got tacked onto related article to give them a how, etc.
The inverse problem exists as well where a single semantic topic is broken up into multiple articles purely for reasons of length. Other types of semantic mismatches include articles along precoordinated facets like Transportation in New York City (or even History of Transportation in New York City!), list articles (* Filmography, * Discography, * Videography, List of *). Of course, some lists, like the Fortune 500, make sense to talk about as entities, but most Wikipedia lists are just mechanically generated things for human browsing which don't really need a semantic identifier. Freebase deleted most of this Wikipedia cruft.
Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets of incompatible types [5] which was used by the merge tools to warn users that the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could make it a hard restriction or just a warning.
Tom
[1] https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldi... [2] http://www.freebase.com/biology/protein/entrez_gene_id [3] https://www.wikidata.org/w/index.php?title=Q414043&type=revision&dif... [4] https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=hi... [5] http://www.freebase.com/dataworld/incompatible_types?instances=
On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good ben.mcgee.good@gmail.com wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Wed, Oct 28, 2015 at 3:55 PM, Benjamin Good ben.mcgee.good@gmail.com wrote:
It sounds like Tom and James have basically the same idea for our particular problem, which I would support: enable a warning in the merge script when incompatible types are detected. These would have to be encoded somehow though - presumably in the property constraints.
I think they differ semantically in that one operates at the class/type level, while the other operates on pairs of instances (if I understand the property's semantics). Another more general check might be to see if the proposed merge will result in any property values, such as P688 encodes, which point to themselves. That's usually a sign of a structural problem.
Tom>>> For all languages except English, it's the protein Wikidata item
[1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].
I don't think that this is ubiquitously true, though it is true in many cases. This happened because the original imports from Wikipedia tagged the wikidata items about gene/proteins as proteins. We converted all the EN Wikilinks that we knew about programmatically but shied away from doing that for all the other languages.
Sorry, I didn't mean to imply that it was generally true, but rather true for the example I was looking at (Reelin [3]). Since the opening sentence begins "Reelin is a large secreted extracellular matrix glycoprotein ...," I'd say that the article is about a protein [1], yet it's linked to a gene [2]. For articles which are about multiple Wikidata items, I guess another possible answer is that they shouldn't be linked to anything item (or perhaps all related items if that's technically possible).
[1] https://www.wikidata.org/wiki/Q13561329 [2] https://www.wikidata.org/wiki/Q414043 [3] https://en.wikipedia.org/wiki/Reelin
Tom
Of course, Wikidata (by design) doesn't have formal typing of items; and it can be pretty domain-specific (and fluid) as to what aspects from different classes can or can't be combined on the same item.
So I think general checks that operate at the class/type level would be hard to specify. On the other hand identifying particular pairs which should not be merged should I think be comparatively easy to record, and comparatively easy to act on.
-- James.
On 28/10/2015 20:24, Tom Morris wrote:
On Wed, Oct 28, 2015 at 3:55 PM, Benjamin Good ben.mcgee.good@gmail.com wrote:
It sounds like Tom and James have basically the same idea for our particular problem, which I would support: enable a warning in the merge script when incompatible types are detected. These would have to be encoded somehow though - presumably in the property constraints.
I think they differ semantically in that one operates at the class/type level, while the other operates on pairs of instances (if I understand the property's semantics). Another more general check might be to see if the proposed merge will result in any property values, such as P688 encodes, which point to themselves. That's usually a sign of a structural problem.
Tom>>> For all languages except English, it's the protein Wikidata item
[1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].
I don't think that this is ubiquitously true, though it is true in many cases. This happened because the original imports from Wikipedia tagged the wikidata items about gene/proteins as proteins. We converted all the EN Wikilinks that we knew about programmatically but shied away from doing that for all the other languages.
Sorry, I didn't mean to imply that it was generally true, but rather true for the example I was looking at (Reelin [3]). Since the opening sentence begins "Reelin is a large secreted extracellular matrix glycoprotein ...," I'd say that the article is about a protein [1], yet it's linked to a gene [2]. For articles which are about multiple Wikidata items, I guess another possible answer is that they shouldn't be linked to anything item (or perhaps all related items if that's technically possible).
[1] https://www.wikidata.org/wiki/Q13561329 [2] https://www.wikidata.org/wiki/Q414043 [3] https://en.wikipedia.org/wiki/Reelin
Tom
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 10/28/2015 12:08 PM, Tom Morris wrote: [...]
Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets of incompatible types [5] which was used by the merge tools to warn users that the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could make it a hard restriction or just a warning.
Tom
I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging.
I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level.
peter
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain.
An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge.
If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least.
On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider < pfpschneider@gmail.com> wrote:
On 10/28/2015 12:08 PM, Tom Morris wrote: [...]
Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets
of
incompatible types [5] which was used by the merge tools to warn users
that
the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could
make it
a hard restriction or just a warning.
Tom
I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging.
I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level.
peter
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I think that using P1889 in this way is abusing its meaning.
Q16657504 P1889 Q6525093 doesn't mean that Q16657504 should not be merged with some other human item in Wikidata.
peter
On 10/28/2015 03:41 PM, Magnus Manske wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain.
An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge.
If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least.
On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider <pfpschneider@gmail.com mailto:pfpschneider@gmail.com> wrote:
On 10/28/2015 12:08 PM, Tom Morris wrote: [...] > Going back to Ben's original problem, one tool that Freebase used to help > manage the problem of incompatible type merges was a set of curated sets of > incompatible types [5] which was used by the merge tools to warn users that > the merge they were proposing probably wasn't a good idea. People could > ignore the warning in the Freebase implementation, but Wikidata could make it > a hard restriction or just a warning. > > Tom I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging. I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level. peter _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
For what its worth, I tend to agree with Peter here. It makes sense to me to add constraints akin to 'disjoint with' at the class level. The problem I see is that we don't exactly have classes here as the term is used elsewhere. I guess in wikidata, a 'class' is any entity that happens to be used in a subclassOf claim ?
Another way forward could be to do this using properties rather than classes. I think this could allow use to use the constraint-checking infrastructure that is already in place? You could add a constraint on a property that it is 'incompatible with' another property. In the protein/gene case we could pragmatically use Property:P351 (entrez gene id), incompatible with Property:P352 (uniprot gene id). More semantically, we could use 'encoded by' incompatible-with 'encodes' or 'genomic start'
On Wed, Oct 28, 2015 at 5:08 PM, Peter F. Patel-Schneider < pfpschneider@gmail.com> wrote:
I think that using P1889 in this way is abusing its meaning.
Q16657504 P1889 Q6525093 doesn't mean that Q16657504 should not be merged with some other human item in Wikidata.
peter
On 10/28/2015 03:41 PM, Magnus Manske wrote:
I fear my games may contribute to both problems (merging two items, and
adding
a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures
of the
respective knowledge domain.
An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge.
If that is acceptable, it would be easy for me to filter all items with
P1889,
from the merge game at least.
On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider <pfpschneider@gmail.com mailto:pfpschneider@gmail.com> wrote:
On 10/28/2015 12:08 PM, Tom Morris wrote: [...] > Going back to Ben's original problem, one tool that Freebase used
to help
> manage the problem of incompatible type merges was a set of
curated sets of
> incompatible types [5] which was used by the merge tools to warn
users that
> the merge they were proposing probably wasn't a good idea. People
could
> ignore the warning in the Freebase implementation, but Wikidata
could
make it > a hard restriction or just a warning. > > Tom I think that this idea is a good one. The incompatibility
information could
be added to classes in the form of "this class is disjoint from that
other
class". Tools would then be able to look for this information and
produce
warnings or even have stronger reactions to proposed merging. I'm not sure that using P1889 "different from" is going to be
adequate. What
links would be needed? Just between a gene and its protein? That
wouldn't
catch merging a gene and a related protein. Between all genes and
all
proteins? It seems to me that this is better handled at the class
level.
peter _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 29.10.2015 05:41, Benjamin Good wrote:
For what its worth, I tend to agree with Peter here. It makes sense to me to add constraints akin to 'disjoint with' at the class level.
+1 for having this. This does not preclude to have an additional mechanism on the instance level if needed to augment the main thing, but the classes are an easier way to start.
This can also help with detecting other issues that are unrelated to merging. For instance, nothing should be an event and an airplane at the same time.
We need a common approach on how to deal with ambiguous Wikipedia articles. One option would be to create an "auxiliary" item that is not linked to Wikipedia in such a case, but that is used to represent some aspects of the "main" item that would otherwise be incompatible.
Benjamin is right that these issues are not specific to the bio domain. It's rather the opposite: the bio domain is one of the domains that is advanced enough to notice these problems ...
The problem I see is that we don't exactly have classes here as the term is used elsewhere. I guess in wikidata, a 'class' is any entity that happens to be used in a subclassOf claim ?
In this case, one can leave this to the user: two items that are specified to be disjoint classes are classes.
In the Wikidata Taxonomy Browser, we consider items as classes if one of the following is true: (1) they have a "subclass of" statement (2) they are the target of a "subclass of" statement (3) they are the target of an "instance of" statement
We then (mostly) ignore the classes that do not have own instances or own subclasses (the "leafs" in the taxonomy), since these are very many: * The above criterion leads to over 200,000 class items. * Only about 20,000 of them have instances or subclasses.
Another way forward could be to do this using properties rather than classes. I think this could allow use to use the constraint-checking infrastructure that is already in place? You could add a constraint on a property that it is 'incompatible with' another property. In the protein/gene case we could pragmatically use Property:P351 (entrez gene id), incompatible with Property:P352 (uniprot gene id). More semantically, we could use 'encoded by' incompatible-with 'encodes' or 'genomic start'
I think the constraint checking infrastructure should be able to handle both approaches equally well. If "disjoint with" is a statement, one could even check this constraint in SPARQL (possibly further restricting to query only for constraint violations in a particular domain).
Cheers,
Markus
On Wed, Oct 28, 2015 at 5:08 PM, Peter F. Patel-Schneider
There is a proposal for some kind of class disjointness : https://www.wikidata.org/wiki/Wikidata:Property_proposal/Generic#subclass this is here for a while now, maybe a few more supporters would speed up the process :)
I think a proposal for "DisjointWith" was rejected a long time ago. But another one could pass.
2015-11-10 13:27 GMT+01:00 Markus Krötzsch markus@semantic-mediawiki.org:
On 29.10.2015 05:41, Benjamin Good wrote:
For what its worth, I tend to agree with Peter here. It makes sense to me to add constraints akin to 'disjoint with' at the class level.
+1 for having this. This does not preclude to have an additional mechanism on the instance level if needed to augment the main thing, but the classes are an easier way to start.
This can also help with detecting other issues that are unrelated to merging. For instance, nothing should be an event and an airplane at the same time.
We need a common approach on how to deal with ambiguous Wikipedia articles. One option would be to create an "auxiliary" item that is not linked to Wikipedia in such a case, but that is used to represent some aspects of the "main" item that would otherwise be incompatible.
Benjamin is right that these issues are not specific to the bio domain. It's rather the opposite: the bio domain is one of the domains that is advanced enough to notice these problems ...
The
problem I see is that we don't exactly have classes here as the term is used elsewhere. I guess in wikidata, a 'class' is any entity that happens to be used in a subclassOf claim ?
In this case, one can leave this to the user: two items that are specified to be disjoint classes are classes.
In the Wikidata Taxonomy Browser, we consider items as classes if one of the following is true: (1) they have a "subclass of" statement (2) they are the target of a "subclass of" statement (3) they are the target of an "instance of" statement
We then (mostly) ignore the classes that do not have own instances or own subclasses (the "leafs" in the taxonomy), since these are very many:
- The above criterion leads to over 200,000 class items.
- Only about 20,000 of them have instances or subclasses.
Another way forward could be to do this using properties rather than classes. I think this could allow use to use the constraint-checking infrastructure that is already in place? You could add a constraint on a property that it is 'incompatible with' another property. In the protein/gene case we could pragmatically use Property:P351 (entrez gene id), incompatible with Property:P352 (uniprot gene id). More semantically, we could use 'encoded by' incompatible-with 'encodes' or 'genomic start'
I think the constraint checking infrastructure should be able to handle both approaches equally well. If "disjoint with" is a statement, one could even check this constraint in SPARQL (possibly further restricting to query only for constraint violations in a particular domain).
Cheers,
Markus
On Wed, Oct 28, 2015 at 5:08 PM, Peter F. Patel-Schneider
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 11.11.2015 11:33, Thomas Douillard wrote:
There is a proposal for some kind of class disjointness : https://www.wikidata.org/wiki/Wikidata:Property_proposal/Generic#subclass this is here for a while now, maybe a few more supporters would speed up the process :)
Interesting. This looks like a more complex modelling that combines "union of" with "disjoint classes". I would prefer to have the simpler modelling primitives before introducing such a shortcut.
There is also an slight mismatch between the Wikidata statement format (with main value and qualifiers) and the use for assigning a set of values (two classes, that are equally important). It's clear that we have to do something like this if we want to make such statements, but I would prefer an encoding where the classes are both in qualifiers, e.g.:
<nucleon (Q102165)> Disjoint union of SOME VALUE of <proton (Q2294)> of <neutron (Q2348)>
This is also similar to what is done in OWL, and we already have the "of" qualifier.
Will add this comment.
I think a proposal for "DisjointWith" was rejected a long time ago. But another one could pass.
Yes, I think we should revisit this decision in the light of the new requirements and our grown experience in working with Wikidata.
Markus
Magnus,
We are seeing more and more of these problematic merges. See: http://tinyurl.com/ovutz5x for the current list of (today 61) problems. Are these coming from the wikidata game?
All of the editors performing the merges seem to be new and the edit patterns seem to match the game. I thought the edits were tagged with a statement about them coming from the game, but I don't see that? If they are, could you just take genes and proteins out of the 'potential merge' queue ? I'm guessing that their frequently very similar names are putting many of them into the list.
We are starting to work on a bot to combat this, but would like to stop the main source of the damage if its possible to detect it. This is making Wikipedia integration more challenging than it already is...
thanks -Ben
On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske magnusmanske@googlemail.com wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain.
An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge.
If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least.
On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider < pfpschneider@gmail.com> wrote:
On 10/28/2015 12:08 PM, Tom Morris wrote: [...]
Going back to Ben's original problem, one tool that Freebase used to
help
manage the problem of incompatible type merges was a set of curated
sets of
incompatible types [5] which was used by the merge tools to warn users
that
the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could
make it
a hard restriction or just a warning.
Tom
I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging.
I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level.
peter
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Ben,
looked at the first result from your query. Merge was done by a new user, who seems to have an "interest" in biology: https://www.wikidata.org/wiki/Special:Contributions/H%C3%AA_de_tekhn%C3%AA_m...
Second result, another user with no user page, same pattern: https://www.wikidata.org/wiki/Special:Contributions/Nguyenld
Note that both do other things to items in the "realm", so it doesn't appear to be my game. Both have done multiple merges.
Haven't looked at more results/ Will look into game mods anyway.
Cheers, Magnus
On Mon, Nov 9, 2015 at 11:34 PM Benjamin Good ben.mcgee.good@gmail.com wrote:
Magnus,
We are seeing more and more of these problematic merges. See: http://tinyurl.com/ovutz5x for the current list of (today 61) problems. Are these coming from the wikidata game?
All of the editors performing the merges seem to be new and the edit patterns seem to match the game. I thought the edits were tagged with a statement about them coming from the game, but I don't see that? If they are, could you just take genes and proteins out of the 'potential merge' queue ? I'm guessing that their frequently very similar names are putting many of them into the list.
We are starting to work on a bot to combat this, but would like to stop the main source of the damage if its possible to detect it. This is making Wikipedia integration more challenging than it already is...
thanks -Ben
On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske < magnusmanske@googlemail.com> wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain.
An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge.
If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least.
On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider < pfpschneider@gmail.com> wrote:
On 10/28/2015 12:08 PM, Tom Morris wrote: [...]
Going back to Ben's original problem, one tool that Freebase used to
help
manage the problem of incompatible type merges was a set of curated
sets of
incompatible types [5] which was used by the merge tools to warn users
that
the merge they were proposing probably wasn't a good idea. People
could
ignore the warning in the Freebase implementation, but Wikidata could
make it
a hard restriction or just a warning.
Tom
I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging.
I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level.
peter
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Isn't Magnus Manske's game tagging the edit with "Widar"? I do not see that for, for instance, the user Hê de tekhnê makrê.
I must say, being a wannabe bioinformatician, that the gene/protein data in Wikidata can be confusing. Take https://www.wikidata.org/wiki/Q14907009 which had a merging problem (that I have tried to resolve).
Even before merging https://www.wikidata.org/w/index.php?title=Q14907009&oldid=261061025 this human gene had three gene products "cyclin-dependent kinase inhibitor 2A", "P14ARF" (which to me looked like a gene symbol, I changed it to p14ARF), and "Tumor suppressor ARF". One of them is a mouse protein. One of the others link to http://www.uniprot.org/uniprot/Q8N726 Here the recommended name is "Tumor suppressor ARF" while alternative names are "Cyclin-dependent kinase inhibitor 2A" and "p14ARF". To me it seems that one gene codes two proteins that can be referred to by the same name.
I hope my edits haven't made more damage than good. Several P1889s would be nice.
I think, as someone suggested, that adding P1889 and having Wikibase merging looking at P1889 would be a solution.
/Finn
On 11/10/2015 12:34 AM, Benjamin Good wrote:
Magnus,
We are seeing more and more of these problematic merges. See: http://tinyurl.com/ovutz5x for the current list of (today 61) problems. Are these coming from the wikidata game?
All of the editors performing the merges seem to be new and the edit patterns seem to match the game. I thought the edits were tagged with a statement about them coming from the game, but I don't see that? If they are, could you just take genes and proteins out of the 'potential merge' queue ? I'm guessing that their frequently very similar names are putting many of them into the list.
We are starting to work on a bot to combat this, but would like to stop the main source of the damage if its possible to detect it. This is , making Wikipedia integration more challenging than it already is...
thanks -Ben
On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain. An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge. If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least. On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote: On 10/28/2015 12:08 PM, Tom Morris wrote: [...] > Going back to Ben's original problem, one tool that Freebase used to help > manage the problem of incompatible type merges was a set of curated sets of > incompatible types [5] which was used by the merge tools to warn users that > the merge they were proposing probably wasn't a good idea. People could > ignore the warning in the Freebase implementation, but Wikidata could make it > a hard restriction or just a warning. > > Tom I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging. I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level. peter _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
FWIW, checked again. Neither game can merge two items that link to each other. So, if the protein is "expressed by" the gene, that pair will not even be suggested.
On Tue, Nov 10, 2015 at 10:19 AM Finn Årup Nielsen fn@imm.dtu.dk wrote:
Isn't Magnus Manske's game tagging the edit with "Widar"? I do not see that for, for instance, the user Hê de tekhnê makrê.
I must say, being a wannabe bioinformatician, that the gene/protein data in Wikidata can be confusing. Take https://www.wikidata.org/wiki/Q14907009 which had a merging problem (that I have tried to resolve).
Even before merging https://www.wikidata.org/w/index.php?title=Q14907009&oldid=261061025 this human gene had three gene products "cyclin-dependent kinase inhibitor 2A", "P14ARF" (which to me looked like a gene symbol, I changed it to p14ARF), and "Tumor suppressor ARF". One of them is a mouse protein. One of the others link to http://www.uniprot.org/uniprot/Q8N726 Here the recommended name is "Tumor suppressor ARF" while alternative names are "Cyclin-dependent kinase inhibitor 2A" and "p14ARF". To me it seems that one gene codes two proteins that can be referred to by the same name.
I hope my edits haven't made more damage than good. Several P1889s would be nice.
I think, as someone suggested, that adding P1889 and having Wikibase merging looking at P1889 would be a solution.
/Finn
On 11/10/2015 12:34 AM, Benjamin Good wrote:
Magnus,
We are seeing more and more of these problematic merges. See: http://tinyurl.com/ovutz5x for the current list of (today 61) problems. Are these coming from the wikidata game?
All of the editors performing the merges seem to be new and the edit patterns seem to match the game. I thought the edits were tagged with a statement about them coming from the game, but I don't see that? If they are, could you just take genes and proteins out of the 'potential merge' queue ? I'm guessing that their frequently very similar names are putting many of them into the list.
We are starting to work on a bot to combat this, but would like to stop the main source of the damage if its possible to detect it. This is , making Wikipedia integration more challenging than it already is...
thanks -Ben
On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com>
wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain. An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge. If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least. On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote: On 10/28/2015 12:08 PM, Tom Morris wrote: [...] > Going back to Ben's original problem, one tool that Freebase used to help > manage the problem of incompatible type merges was a set of curated sets of > incompatible types [5] which was used by the merge tools to warn users that > the merge they were proposing probably wasn't a good idea. People could > ignore the warning in the Freebase implementation, but Wikidata could make it > a hard restriction or just a warning. > > Tom I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging. I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level. peter _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Finn,
Thanks, I know the gene-protein thing is confusing. The example you raise there shows nicely why things are set up the way they are. One of the challenges is that there are so many related, but fundamentally different things to deal with that any reliance on human readable names leads almost immediately to data disaster.. This is why we have been working hard on bringing in all the various unique identifier properties for these items.
(The link to the mouse protein was a mistake.. the bot seems to have had some mouse related problems lately - Andra is working to fix them.)
-Ben
On Tue, Nov 10, 2015 at 2:18 AM, Finn Årup Nielsen fn@imm.dtu.dk wrote:
Isn't Magnus Manske's game tagging the edit with "Widar"? I do not see that for, for instance, the user Hê de tekhnê makrê.
I must say, being a wannabe bioinformatician, that the gene/protein data in Wikidata can be confusing. Take https://www.wikidata.org/wiki/Q14907009 which had a merging problem (that I have tried to resolve).
Even before merging https://www.wikidata.org/w/index.php?title=Q14907009&oldid=261061025 this human gene had three gene products "cyclin-dependent kinase inhibitor 2A", "P14ARF" (which to me looked like a gene symbol, I changed it to p14ARF), and "Tumor suppressor ARF". One of them is a mouse protein. One of the others link to http://www.uniprot.org/uniprot/Q8N726 Here the recommended name is "Tumor suppressor ARF" while alternative names are "Cyclin-dependent kinase inhibitor 2A" and "p14ARF". To me it seems that one gene codes two proteins that can be referred to by the same name.
I hope my edits haven't made more damage than good. Several P1889s would be nice.
I think, as someone suggested, that adding P1889 and having Wikibase merging looking at P1889 would be a solution.
/Finn
On 11/10/2015 12:34 AM, Benjamin Good wrote:
Magnus,
We are seeing more and more of these problematic merges. See: http://tinyurl.com/ovutz5x for the current list of (today 61) problems. Are these coming from the wikidata game?
All of the editors performing the merges seem to be new and the edit patterns seem to match the game. I thought the edits were tagged with a statement about them coming from the game, but I don't see that? If they are, could you just take genes and proteins out of the 'potential merge' queue ? I'm guessing that their frequently very similar names are putting many of them into the list.
We are starting to work on a bot to combat this, but would like to stop the main source of the damage if its possible to detect it. This is , making Wikipedia integration more challenging than it already is...
thanks -Ben
On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain. An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge. If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least. On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote: On 10/28/2015 12:08 PM, Tom Morris wrote: [...] > Going back to Ben's original problem, one tool that Freebase used to help > manage the problem of incompatible type merges was a set of curated sets of > incompatible types [5] which was used by the merge tools to warn users that > the merge they were proposing probably wasn't a good idea. People could > ignore the warning in the Freebase implementation, but Wikidata could make it > a hard restriction or just a warning. > > Tom I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging. I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level. peter _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
@Ben, be careful about
"any reliance on human readable names leads almost immediately to data disaster"
It is practically an axiom of the generics database field since at least the 1980s in the works of people like Lenat and Guha that this is the case.
When I tell this to people in person though they are immediately disappointed, you can see it in their face, you know they have a visceral reaction to the thought that they have to use strange numbers for everything. The original :BaseKB used mids in any place it was obvious to use a mid and by doing so I knew that I had preserved the structures that were there and had the entity resolution 100% right.
Terms like P1889 become insider secrets like all the coded genes and proteins, or the use of terms like A15 by the Situationist International, or R6 by Scientology. Communities can form around them, but they become a barrier to people who are less from the community.
For systems like this to get really mainstream we need some way to bridge this chasm, and that could be more intelligent "context-sensitive" languages that are somewhere intermediate between natural and computer languages, also interfaces that use any means necessary to reduce the cognitive load of needing to learn and remember not just P1889, which will be notorious, but all of the other predicates which are the reason we come to WIkidata.
On Tue, Nov 10, 2015 at 11:38 AM, Benjamin Good ben.mcgee.good@gmail.com wrote:
Finn,
Thanks, I know the gene-protein thing is confusing. The example you raise there shows nicely why things are set up the way they are. One of the challenges is that there are so many related, but fundamentally different things to deal with that any reliance on human readable names leads almost immediately to data disaster.. This is why we have been working hard on bringing in all the various unique identifier properties for these items.
(The link to the mouse protein was a mistake.. the bot seems to have had some mouse related problems lately - Andra is working to fix them.)
-Ben
On Tue, Nov 10, 2015 at 2:18 AM, Finn Årup Nielsen fn@imm.dtu.dk wrote:
Isn't Magnus Manske's game tagging the edit with "Widar"? I do not see that for, for instance, the user Hê de tekhnê makrê.
I must say, being a wannabe bioinformatician, that the gene/protein data in Wikidata can be confusing. Take https://www.wikidata.org/wiki/Q14907009 which had a merging problem (that I have tried to resolve).
Even before merging https://www.wikidata.org/w/index.php?title=Q14907009&oldid=261061025 this human gene had three gene products "cyclin-dependent kinase inhibitor 2A", "P14ARF" (which to me looked like a gene symbol, I changed it to p14ARF), and "Tumor suppressor ARF". One of them is a mouse protein. One of the others link to http://www.uniprot.org/uniprot/Q8N726 Here the recommended name is "Tumor suppressor ARF" while alternative names are "Cyclin-dependent kinase inhibitor 2A" and "p14ARF". To me it seems that one gene codes two proteins that can be referred to by the same name.
I hope my edits haven't made more damage than good. Several P1889s would be nice.
I think, as someone suggested, that adding P1889 and having Wikibase merging looking at P1889 would be a solution.
/Finn
On 11/10/2015 12:34 AM, Benjamin Good wrote:
Magnus,
We are seeing more and more of these problematic merges. See: http://tinyurl.com/ovutz5x for the current list of (today 61) problems. Are these coming from the wikidata game?
All of the editors performing the merges seem to be new and the edit patterns seem to match the game. I thought the edits were tagged with a statement about them coming from the game, but I don't see that? If they are, could you just take genes and proteins out of the 'potential merge' queue ? I'm guessing that their frequently very similar names are putting many of them into the list.
We are starting to work on a bot to combat this, but would like to stop the main source of the damage if its possible to detect it. This is , making Wikipedia integration more challenging than it already is...
thanks -Ben
On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
I fear my games may contribute to both problems (merging two items, and adding a sitelink to the wrong item). Both are facilitated by identical names/aliases, and sometimes it's hard to tell that a pair is meant to be different, especially if you don't know about the intricate structures of the respective knowledge domain. An item-specific, but somewhat heavy-handed approach would be to prevent merging of any two items where at least one has P1889, no matter what it specifically points to. At least, give a warning that an item is "merge-protected", and require an additional override for the merge. If that is acceptable, it would be easy for me to filter all items with P1889, from the merge game at least. On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote: On 10/28/2015 12:08 PM, Tom Morris wrote: [...] > Going back to Ben's original problem, one tool that Freebase used to help > manage the problem of incompatible type merges was a set of curated sets of > incompatible types [5] which was used by the merge tools to warn users that > the merge they were proposing probably wasn't a good idea. People could > ignore the warning in the Freebase implementation, but Wikidata could make it > a hard restriction or just a warning. > > Tom I think that this idea is a good one. The incompatibility information could be added to classes in the form of "this class is disjoint from that other class". Tools would then be able to look for this information and produce warnings or even have stronger reactions to proposed merging. I'm not sure that using P1889 "different from" is going to be adequate. What links would be needed? Just between a gene and its protein? That wouldn't catch merging a gene and a related protein. Between all genes and all proteins? It seems to me that this is better handled at the class level. peter _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata