Genes, proteins, and bad merges in general

List overview All Threads
Download

newer

older

RFC - Primary Sources?

[CfP] ISWC 2016 - Call for Posters...

Tom Morris

14 Jun 2016 14 Jun '16

8:53 p.m.

Bad merges have been mentioned a couple of times recently and I think one of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

1. Merging is too easy. Because splitting/unmerging is much harder than merging, particularly after additional edits, the process should be biased to mark merging more difficult.

2. The impedance mismatch between Wikidata and Wikipedias tempts wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over time, but the first could be improved, in my opinion, by requiring more than one person to approve a merge. The Freebase scheme was that duplicate topics could be flagged for merge by anyone, but instead of merging, they'd be placed in a queue for voting. Unanimous votes would cause merges to be automatically processed. Conflicting votes would get bumped to a second level queue for manual handling. This wasn't foolproof, but caught a lot of the naive "these two things have the same name, so they must be the same thing" merge proposals by newbies. There are lots of variations that could be implemented, but the general idea is to get more than one pair of eyes involved.

A specific instance of the structural impedance mismatch is enwiki's handling of genes & proteins. Sometimes they have a page for each, but often they have a single page that deals with both or, worse, a page who's text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be merged: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT...

I recently ran across a similar situation where this Wikidata gene SPATA5 https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about the associated protein https://en.wikipedia.org/wiki/SPATA5, while the Wikidata protein is not linked to any wikis https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very difficult and the resulting errors encourage erroneous merges. The gene/protein case probably needs multiple fixes, but many mergers harder would help.

Tom

Attachments:

attachment.htm (text/html — 2.7 KB)

Show replies by date

Benjamin Good

14 Jun 14 Jun

10:20 p.m.

Hi Tom,

I think the example you have there is actually linked up properly at the moment? https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the protein as are most Wikipedia articles of this nature. And it is linked to the gene the way we encourage modeling https://www.wikidata.org/wiki/Q18052679 - and indeed the protein item is not linked to a Wikipedia article again following our preferred pattern.

For the moment... _our_ merge problem seems to be mostly resolved. Correcting the sitelinks on the non-english Wikipedias in a big batch seemed to help slow the flow dramatically. We have also introduced some flexibility into the Lua code that produces infobox_gene on Wikipedia. It can handle most of the possible situations (e.g. wikipedia linked to protein, wikipedia linked to gene) automatically so that helps prevent visible disasters..

On the main issue you raise about merges.. I'm a little on the fence. Generally I'm opposed to putting constraints in place that slow people down - e.g. we have a lot of manual merge work that needs to be done in the medical arena and I do appreciate that the current process is pretty fast. I guess I would advocate a focus on making the interface more vehemently educational as a first step. E.g. lots of 'are you sure' etc. forms to click through but ultimately still letting people get their work done without enforcing an approval process.

-Ben

On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris tfmorris@gmail.com wrote:

...

Bad merges have been mentioned a couple of times recently and I think one of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

Merging is too easy. Because splitting/unmerging is much harder than

merging, particularly after additional edits, the process should be biased to mark merging more difficult.

The impedance mismatch between Wikidata and Wikipedias tempts

wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over time, but the first could be improved, in my opinion, by requiring more than one person to approve a merge. The Freebase scheme was that duplicate topics could be flagged for merge by anyone, but instead of merging, they'd be placed in a queue for voting. Unanimous votes would cause merges to be automatically processed. Conflicting votes would get bumped to a second level queue for manual handling. This wasn't foolproof, but caught a lot of the naive "these two things have the same name, so they must be the same thing" merge proposals by newbies. There are lots of variations that could be implemented, but the general idea is to get more than one pair of eyes involved.

A specific instance of the structural impedance mismatch is enwiki's handling of genes & proteins. Sometimes they have a page for each, but often they have a single page that deals with both or, worse, a page who's text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be merged: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT...

I recently ran across a similar situation where this Wikidata gene SPATA5 https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about the associated protein https://en.wikipedia.org/wiki/SPATA5, while the Wikidata protein is not linked to any wikis https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very difficult and the resulting errors encourage erroneous merges. The gene/protein case probably needs multiple fixes, but many mergers harder would help.

Tom

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

11:08 p.m.

Hoi, I add "many" entries. As a consequence I make the occasional mistake. Typically I find them myself and rectify. When you interfere with that, I can no longer sort out the mess I make. That is fine. It is then for someone else to fix. Thanks, GerardM

On 14 June 2016 at 21:20, Benjamin Good ben.mcgee.good@gmail.com wrote:

...

Hi Tom,

I think the example you have there is actually linked up properly at the moment? https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the protein as are most Wikipedia articles of this nature. And it is linked to the gene the way we encourage modeling https://www.wikidata.org/wiki/Q18052679 - and indeed the protein item is not linked to a Wikipedia article again following our preferred pattern.

For the moment... _our_ merge problem seems to be mostly resolved. Correcting the sitelinks on the non-english Wikipedias in a big batch seemed to help slow the flow dramatically. We have also introduced some flexibility into the Lua code that produces infobox_gene on Wikipedia. It can handle most of the possible situations (e.g. wikipedia linked to protein, wikipedia linked to gene) automatically so that helps prevent visible disasters..

On the main issue you raise about merges.. I'm a little on the fence. Generally I'm opposed to putting constraints in place that slow people down

e.g. we have a lot of manual merge work that needs to be done in the

medical arena and I do appreciate that the current process is pretty fast. I guess I would advocate a focus on making the interface more vehemently educational as a first step. E.g. lots of 'are you sure' etc. forms to click through but ultimately still letting people get their work done without enforcing an approval process.

-Ben

On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris tfmorris@gmail.com wrote:

...
Bad merges have been mentioned a couple of times recently and I think one of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

Merging is too easy. Because splitting/unmerging is much harder than

merging, particularly after additional edits, the process should be biased to mark merging more difficult.

The impedance mismatch between Wikidata and Wikipedias tempts

wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over time, but the first could be improved, in my opinion, by requiring more than one person to approve a merge. The Freebase scheme was that duplicate topics could be flagged for merge by anyone, but instead of merging, they'd be placed in a queue for voting. Unanimous votes would cause merges to be automatically processed. Conflicting votes would get bumped to a second level queue for manual handling. This wasn't foolproof, but caught a lot of the naive "these two things have the same name, so they must be the same thing" merge proposals by newbies. There are lots of variations that could be implemented, but the general idea is to get more than one pair of eyes involved.

A specific instance of the structural impedance mismatch is enwiki's handling of genes & proteins. Sometimes they have a page for each, but often they have a single page that deals with both or, worse, a page who's text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be merged: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT...

I recently ran across a similar situation where this Wikidata gene SPATA5 https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about the associated protein https://en.wikipedia.org/wiki/SPATA5, while the Wikidata protein is not linked to any wikis https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very difficult and the resulting errors encourage erroneous merges. The gene/protein case probably needs multiple fixes, but many mergers harder would help.

Tom

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Tom Morris

15 Jun 15 Jun

1:39 a.m.

Hi Gerard. There's often a tension between supporting "power users" and the regular users, but in this case, I left out a little nuance - if you flagged an item created by yourself for either deletion or merger and no one else had edited it in the mean time, the operation was processed automatically without having to go through the voting process. This allowed everyone to fix their own mistakes quickly. Finding the right balance for these processes typically takes a little tuning.

I forgot to mention another aspect of the current merge process that I think is dangerous and I've seen cause problems and that is merge "games." High impact operations like merges seem like a particularly poor fit for gamification, particularly when there's no safety net such as a second set of eyes.

Tom

On Tue, Jun 14, 2016 at 4:08 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, I add "many" entries. As a consequence I make the occasional mistake. Typically I find them myself and rectify. When you interfere with that, I can no longer sort out the mess I make. That is fine. It is then for someone else to fix. Thanks, GerardM

On 14 June 2016 at 21:20, Benjamin Good ben.mcgee.good@gmail.com wrote:

...
Hi Tom,

I think the example you have there is actually linked up properly at the moment? https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the protein as are most Wikipedia articles of this nature. And it is linked to the gene the way we encourage modeling https://www.wikidata.org/wiki/Q18052679 - and indeed the protein item is not linked to a Wikipedia article again following our preferred pattern.

For the moment... _our_ merge problem seems to be mostly resolved. Correcting the sitelinks on the non-english Wikipedias in a big batch seemed to help slow the flow dramatically. We have also introduced some flexibility into the Lua code that produces infobox_gene on Wikipedia. It can handle most of the possible situations (e.g. wikipedia linked to protein, wikipedia linked to gene) automatically so that helps prevent visible disasters..

On the main issue you raise about merges.. I'm a little on the fence. Generally I'm opposed to putting constraints in place that slow people down

e.g. we have a lot of manual merge work that needs to be done in the

medical arena and I do appreciate that the current process is pretty fast. I guess I would advocate a focus on making the interface more vehemently educational as a first step. E.g. lots of 'are you sure' etc. forms to click through but ultimately still letting people get their work done without enforcing an approval process.

-Ben

On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris tfmorris@gmail.com wrote:

...
Bad merges have been mentioned a couple of times recently and I think one of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

Merging is too easy. Because splitting/unmerging is much harder than

merging, particularly after additional edits, the process should be biased to mark merging more difficult.

The impedance mismatch between Wikidata and Wikipedias tempts

wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over time, but the first could be improved, in my opinion, by requiring more than one person to approve a merge. The Freebase scheme was that duplicate topics could be flagged for merge by anyone, but instead of merging, they'd be placed in a queue for voting. Unanimous votes would cause merges to be automatically processed. Conflicting votes would get bumped to a second level queue for manual handling. This wasn't foolproof, but caught a lot of the naive "these two things have the same name, so they must be the same thing" merge proposals by newbies. There are lots of variations that could be implemented, but the general idea is to get more than one pair of eyes involved.

A specific instance of the structural impedance mismatch is enwiki's handling of genes & proteins. Sometimes they have a page for each, but often they have a single page that deals with both or, worse, a page who's text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be merged: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT...

I recently ran across a similar situation where this Wikidata gene SPATA5 https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about the associated protein https://en.wikipedia.org/wiki/SPATA5, while the Wikidata protein is not linked to any wikis https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very difficult and the resulting errors encourage erroneous merges. The gene/protein case probably needs multiple fixes, but many mergers harder would help.

Tom

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Tom Morris

1:31 a.m.

Hi Ben. On reflection, I think the SPATA5 page is more about the gene, than the protein, despite the lead (and only) sentence which has the protein as its subject. The other example from the RFC (Neurophysin I https://en.wikipedia.org/wiki/Neurophysin_I) seems less clearcut to me since the text is entirely about the protein, while the infobox is the only thing talking about the gene. I generally discount infoboxes since they can have little to do with the main subject of the page (e.g. civil war battle with infoboxes about the opposing generals or the associated NRHP place).

Despite the textual confusion, I realized that there isn't really much merge risk here since the "encodes" property linking the gene and the protein should prevent any attempted merge from happening.

Tom

On Tue, Jun 14, 2016 at 3:20 PM, Benjamin Good ben.mcgee.good@gmail.com wrote:

...

Hi Tom,

I think the example you have there is actually linked up properly at the moment? https://en.wikipedia.org/wiki/SPATA5 is about both the gene and the protein as are most Wikipedia articles of this nature. And it is linked to the gene the way we encourage modeling https://www.wikidata.org/wiki/Q18052679 - and indeed the protein item is not linked to a Wikipedia article again following our preferred pattern.

For the moment... _our_ merge problem seems to be mostly resolved. Correcting the sitelinks on the non-english Wikipedias in a big batch seemed to help slow the flow dramatically. We have also introduced some flexibility into the Lua code that produces infobox_gene on Wikipedia. It can handle most of the possible situations (e.g. wikipedia linked to protein, wikipedia linked to gene) automatically so that helps prevent visible disasters..

On the main issue you raise about merges.. I'm a little on the fence. Generally I'm opposed to putting constraints in place that slow people down

e.g. we have a lot of manual merge work that needs to be done in the

medical arena and I do appreciate that the current process is pretty fast. I guess I would advocate a focus on making the interface more vehemently educational as a first step. E.g. lots of 'are you sure' etc. forms to click through but ultimately still letting people get their work done without enforcing an approval process.

-Ben

On Tue, Jun 14, 2016 at 10:53 AM, Tom Morris tfmorris@gmail.com wrote:

...
Bad merges have been mentioned a couple of times recently and I think one of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

Merging is too easy. Because splitting/unmerging is much harder than

merging, particularly after additional edits, the process should be biased to mark merging more difficult.

The impedance mismatch between Wikidata and Wikipedias tempts

wikipedians who are new to wikidata to do the wrong thing.

The second is a community education issue which will hopefully improve over time, but the first could be improved, in my opinion, by requiring more than one person to approve a merge. The Freebase scheme was that duplicate topics could be flagged for merge by anyone, but instead of merging, they'd be placed in a queue for voting. Unanimous votes would cause merges to be automatically processed. Conflicting votes would get bumped to a second level queue for manual handling. This wasn't foolproof, but caught a lot of the naive "these two things have the same name, so they must be the same thing" merge proposals by newbies. There are lots of variations that could be implemented, but the general idea is to get more than one pair of eyes involved.

A specific instance of the structural impedance mismatch is enwiki's handling of genes & proteins. Sometimes they have a page for each, but often they have a single page that deals with both or, worse, a page who's text says its about the protein, but where the page includes a gene infobox.

This unanswered RFC from Oct 2015 asks whether protein & gene should be merged: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Oxytocin_and_OXT...

I recently ran across a similar situation where this Wikidata gene SPATA5 https://www.wikidata.org/wiki/Q18052679 is linked to an enwiki page about the associated protein https://en.wikipedia.org/wiki/SPATA5, while the Wikidata protein is not linked to any wikis https://www.wikidata.org/wiki/Q21207860

These differences in handling make the reconciliation process very difficult and the resulting errors encourage erroneous merges. The gene/protein case probably needs multiple fixes, but many mergers harder would help.

Tom

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Tony Bowden

8:49 a.m.

On 14 June 2016 at 18:53, Tom Morris tfmorris@gmail.com wrote:

...

A specific instance of the structural impedance mismatch is enwiki's handling of genes & proteins. Sometimes they have a page for each, but often they have a single page that deals with both or, worse, a page who's text says its about the protein, but where the page includes a gene infobox.

This is also a problem with pages on elections. It's very common for national-level elections to be for more than one thing at the same time — e.g. a Presidential election, and a Parliamentary one. In most Wikipedias there will only be a single page for, say, "Brazilian general election, 2014", though occasionally you'll get separate pages in _some_ languages for (for example) "Brazilian legislative election, 2010" and "Brazilian presidential election, 2010" (also split in pt:), whereas those will be combined in other languages (de:Wahlen in Brasilien 2010 / pl:Wybory powszechne w Brazylii w 2010 roku).

Mostly this material hasn't had a lot of attention yet on Wikidata, so it's not _too_ hard to split out separate pages for each conceptually different thing and each of which is 'part of' a wider 'general election' (though for an added twist, the legislative elections are often themselves for multiple houses (eg the Assembly and Senate) simultaneously, and almost never have distinct Wikipedia pages).

I have seen at least one case though where someone then merged two of these, presumably (although I didn't dig into deeply enough to be sure) because each of the Wikidata pages mapped to a single page in "their" Wikipedia. Thankfully this doesn't appear to have been too common an occurrence yet, but that's potentially just because very few of them have even been split up in the first place yet. (Currently I'm largely just picking off the lower hanging fruit of just making sure that each of the national elections in the world over last hundred years or so even has a basic Wikidata entry *at all*.) I'm hoping that such merges would be less likely in cases where each of the individual Wikidata pages had quite rich information on candidates, turnout, winners, etc, but as it's comparatively difficult to even semi-automate the import of statements like that (and many of the existing pages already have confusing combined data presumably from Wikipedia infobox imports from such mismatched pages), it's likely that it'll take quite a while for that to happen, and I fear that there will thus be quite a long period where it will be tempting for people to mis-merge pages.

Tony

Magnus Manske

11:51 a.m.

On Tue, Jun 14, 2016 at 6:54 PM Tom Morris tfmorris@gmail.com wrote:

...

Bad merges have been mentioned a couple of times recently and I think one of the contexts with Ben's gene/protein work.

I think there are two general issues here which could be improved:

Merging is too easy. Because splitting/unmerging is much harder than

merging, particularly after additional edits, the process should be biased to mark merging more difficult.

A technical solution could be to prevent merging of two items (or at

least, show a warning), if one of the items links to the other.