It sounds like Tom and James have basically the same idea for our particular problem, which I would support: enable a warning in the merge script when incompatible types are detected. These would have to be encoded somehow though - presumably in the property constraints.

Tom, a couple points of clarification:

>Tom>>>Another issue is that, as soon as you start trying to split things out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.

That hasn't been a problem for us so far. We have already added all human and all mouse genes to wikidata and people seem okay with that.

>Tom>>> For all languages except English, it's the protein Wikidata item [1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].

I don't think that this is ubiquitously true, though it is true in many cases. This happened because the original imports from Wikipedia tagged the wikidata items about gene/proteins as proteins. We converted all the EN Wikilinks that we knew about programmatically but shied away from doing that for all the other languages.

On Wed, Oct 28, 2015 at 12:22 PM, Tom Morris <tfmorris@gmail.com> wrote:

BTW, merges aren't the only problem. For all languages except English, it's the protein Wikidata item [1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].

[1] https://www.wikidata.org/wiki/Q13561329
[2] https://www.wikidata.org/wiki/Q414043
[3] https://en.wikipedia.org/wiki/Reelin

On Wed, Oct 28, 2015 at 3:08 PM, Tom Morris <tfmorris@gmail.com> wrote:
This is a deep-seated semantic confusion going back to at least 2006 [1] when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively adopted in its initial protein schema in 2007 when it was importing from those infoboxes. Although it made some progress in improving the schema later, anything not aligned with how Wikipedians want to do things is shoveling against the tide. It's also very difficult to manage equivalences when Wikipedia articles are about multiple things like the protein/gene articles.

If you look at the recent merge of Reelin [3] you can see that it was done by the same user who contributed substantially to the article back in 2006 [4], so clearly, as the "owner" of that article, they clearly know what's best. :-) It's going to be very difficult to get people to unlearn a decade of habits.

Another issue is that, as soon as you start trying to split things out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.

This problem of chunking by notability (or lack thereof), length of text article, relatedness, and other attributes rather than semantic individuality is much more widespread than just proteins/genes. It also effects things like pairs (or small sets) of people who aren't notable enough to have an article on their own, articles which contain infoboxes about people who aren't notable, so they got tacked onto related article to give them a how, etc.

The inverse problem exists as well where a single semantic topic is broken up into multiple articles purely for reasons of length. Other types of semantic mismatches include articles along precoordinated facets like Transportation in New York City (or even History of Transportation in New York City!), list articles (* Filmography, * Discography, * Videography, List of *). Of course, some lists, like the Fortune 500, make sense to talk about as entities, but most Wikipedia lists are just mechanically generated things for human browsing which don't really need a semantic identifier. Freebase deleted most of this Wikipedia cruft.

Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets of incompatible types [5] which was used by the merge tools to warn users that the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could make it a hard restriction or just a warning.

Tom

[1] https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldid=56101233
[2] http://www.freebase.com/biology/protein/entrez_gene_id
[3] https://www.wikidata.org/w/index.php?title=Q414043&type=revision&diff=262778265&oldid=262243280
[4] https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=history
[5] http://www.freebase.com/dataworld/incompatible_types?instances=

On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good <ben.mcgee.good@gmail.com> wrote:
The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.

When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])

Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?

I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.

It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.

[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Distinguishing_between_genes_and_proteins
[2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370
[3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/mergelist.txt

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata