[Wikidata] Genes, proteins, and bad merges in general

14 Jun 2016

Bad merges have been mentioned a couple of times recently and I think one
of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

1. Merging is too easy. Because splitting/unmerging is much harder than
merging, particularly after additional edits, the process should be biased
to mark merging more difficult.

2. The impedance mismatch between Wikidata and Wikipedias tempts
wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over
time, but the first could be improved, in my opinion, by requiring more
than one person to approve a merge. The Freebase scheme was that duplicate
topics could be flagged for merge by anyone, but instead of merging, they'd
be placed in a queue for voting. Unanimous votes would cause merges to be
automatically processed. Conflicting votes would get bumped to a second
level queue for manual handling. This wasn't foolproof, but caught a lot of
the naive "these two things have the same name, so they must be the same
thing" merge proposals by newbies. There are lots of variations that could
be implemented, but the general idea is to get more than one pair of eyes
involved.

A specific instance of the structural impedance mismatch is enwiki's
handling of genes & proteins. Sometimes they have a page for each, but
often they have a single page that deals with both or, worse, a page who's
text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be
merged:
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OX…

I recently ran across a similar situation where this Wikidata gene SPATA5
https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about
the associated protein https://en.wikipedia.org/wiki/SPATA5, while the
Wikidata protein is not linked to any wikis
https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very
difficult and the resulting errors encourage erroneous merges. The
gene/protein case probably needs multiple fixes, but many mergers harder
would help.

Tom

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Genes, proteins, and bad merges in general