One approach might be to see whether the merge script could check for property P1889 "different from", and refuse to merge any pairs marked in that way.
Then you could specify that all of the gene items were intentionally different from the protein items.
-- James.
On 28/10/2015 18:10, Benjamin Good wrote:
Yup, that is characteristic of our problem. The last proteinboxbot edit was about the protein item.. This query also works (finds things that are both subclass of gene and subclass of protein)
PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wd: http://www.wikidata.org/entity/ SELECT * WHERE { ?gene wdt:P279 wd:Q7187 . ?gene wdt:P279 wd:Q8054 . }
On Wed, Oct 28, 2015 at 11:04 AM, Finn Årup Nielsen fn@imm.dtu.dk wrote:
The below SPARQL counts 14.
Among them are https://www.wikidata.org/wiki/Q238509 which is "5-HT1A receptor human gene" in English and "5-HT₁A-Rezeptor Protein" in German. The last editor is ProteinBoxBot. It is coded by by itself. That item has a split personality, so it seems that we need to do some cleaning.
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?item WHERE { ?item wdt:P352 ?uniprot ; wdt:P353 ?genesymbol . }
I now see that Teugnhausen has also merged https://www.wikidata.org/wiki/Special:Contributions/Teugnhausen
/Finn
On 10/28/2015 06:07 PM, Benjamin Good wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.
When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])
Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?
I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.
It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.
[1]
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Di... [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 [3]
https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/me...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Finn Årup Nielsen http://people.compute.dtu.dk/faan/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata