This is a deep-seated semantic confusion going back to at least 2006 [1] when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively adopted in its initial protein schema in 2007 when it was importing from those infoboxes. Although it made some progress in improving the schema later, anything not aligned with how Wikipedians want to do things is shoveling against the tide. It's also very difficult to manage equivalences when Wikipedia articles are about multiple things like the protein/gene articles.

If you look at the recent merge of Reelin [3] you can see that it was done by the same user who contributed substantially to the article back in 2006 [4], so clearly, as the "owner" of that article, they clearly know what's best. :-) It's going to be very difficult to get people to unlearn a decade of habits.

Another issue is that, as soon as you start trying to split things out into semantically clean pieces, you immediately run afoul of the notability restrictions. Because human (and mouse) genes don't have their own Wikipedia pages, they're clearly not notable, so they can't be added to Wikidata.

This problem of chunking by notability (or lack thereof), length of text article, relatedness, and other attributes rather than semantic individuality is much more widespread than just proteins/genes. It also effects things like pairs (or small sets) of people who aren't notable enough to have an article on their own, articles which contain infoboxes about people who aren't notable, so they got tacked onto related article to give them a how, etc.

The inverse problem exists as well where a single semantic topic is broken up into multiple articles purely for reasons of length. Other types of semantic mismatches include articles along precoordinated facets like Transportation in New York City (or even History of Transportation in New York City!), list articles (* Filmography, * Discography, * Videography, List of *). Of course, some lists, like the Fortune 500, make sense to talk about as entities, but most Wikipedia lists are just mechanically generated things for human browsing which don't really need a semantic identifier. Freebase deleted most of this Wikipedia cruft.

Going back to Ben's original problem, one tool that Freebase used to help manage the problem of incompatible type merges was a set of curated sets of incompatible types [5] which was used by the merge tools to warn users that the merge they were proposing probably wasn't a good idea. People could ignore the warning in the Freebase implementation, but Wikidata could make it a hard restriction or just a warning.

Tom

[1] https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldid=56101233

[2] http://www.freebase.com/biology/protein/entrez_gene_id

[3] https://www.wikidata.org/w/index.php?title=Q414043&type=revision&diff=262778265&oldid=262243280

[4] https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=history

[5] http://www.freebase.com/dataworld/incompatible_types?instances=

On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good <ben.mcgee.good@gmail.com> wrote:

The Gene Wiki team is experiencing a problem that may suggest some areas for improvement in the general wikidata experience.

When our project was getting started, we had some fairly long public debates about how we should structure the data we wanted to load [1]. These resulted in a data model that, we think, remains pretty much true to the semantics of the data, at the cost of distributing information about closely related things (genes, proteins, orthologs) across multiple, interlinked items. Now, as long as these semantic links between the different item classes are maintained, this is working out great. However, we are consistently seeing people merging items that our model needs to be distinct. Most commonly, we see people merging items about genes with items about the protein product of the gene (e.g. [2]]). This happens nearly every day - especially on items related to the more popular Wikipedia articles. (More examples [3])

Merges like this, as well as other semantics-breaking edits, make it very challenging to build downstream apps (like the wikipedia infobox) that depend on having certain structures in place. My question to the list is how to best protect the semantic models that span multiple entity types in wikidata? Related to this, is there an opportunity for some consistent way of explaining these structures to the community when they exist?

I guess the immediate solutions are to (1) write another bot that watches for model-breaking edits and reverts them and (2) to create an article on wikidata somewhere that succinctly explains the model and links back to the discussions that went into its creation.

It seems that anyone that works beyond a single entity type is going to face the same kind of problems, so I'm posting this here in hopes that generalizable patterns (and perhaps even supporting code) can be realized by this community.

[1] https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Distinguishing_between_genes_and_proteins
[2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370
[3] https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/mergelist.txt

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata