@Ben,  be careful about

"any reliance on human readable names leads almost immediately to data disaster"

It is practically an axiom of the generics database field since at least the 1980s in the works of people like Lenat and Guha that this is the case.

When I tell this to people in person though they are immediately disappointed,  you can see it in their face,  you know they have a visceral reaction to the thought that they have to use strange numbers for everything.  The original :BaseKB used mids in any place it was obvious to use a mid and by doing so I knew that I had preserved the structures that were there and had the entity resolution 100% right.

Terms like P1889 become insider secrets like all the coded genes and proteins,  or the use of terms like A15 by the Situationist International,  or R6 by Scientology.  Communities can form around them,  but they become a barrier to people who are less from the community.

For systems like this to get really mainstream we need some way to bridge this chasm,  and that could be more intelligent "context-sensitive" languages that are somewhere intermediate between natural and computer languages,  also interfaces that use any means necessary to reduce the cognitive load of needing to learn and remember not just P1889,  which will be notorious,  but all of the other predicates which are the reason we come to WIkidata.

On Tue, Nov 10, 2015 at 11:38 AM, Benjamin Good <ben.mcgee.good@gmail.com> wrote:
Finn,

Thanks, I know the gene-protein thing is confusing.  The example you raise there shows nicely why things are set up the way they are.  One of the challenges is that there are so many related, but fundamentally different things to deal with that any reliance on human readable names leads almost immediately to data disaster.. This is why we have been working hard on bringing in all the various unique identifier properties for these items.

(The link to the mouse protein was a mistake.. the bot seems to have had some mouse related problems lately - Andra is working to fix them.)

-Ben

On Tue, Nov 10, 2015 at 2:18 AM, Finn Årup Nielsen <fn@imm.dtu.dk> wrote:
Isn't Magnus Manske's game tagging the edit with "Widar"? I do not see that for, for instance, the user Hê de tekhnê makrê.

I must say, being a wannabe bioinformatician, that the gene/protein data in Wikidata can be confusing. Take https://www.wikidata.org/wiki/Q14907009 which had a merging problem (that I have tried to resolve).

Even before merging https://www.wikidata.org/w/index.php?title=Q14907009&oldid=261061025 this human gene had three gene products "cyclin-dependent kinase inhibitor 2A", "P14ARF" (which to me looked like a gene symbol, I changed it to p14ARF), and "Tumor suppressor ARF". One of them is a mouse protein. One of the others link to http://www.uniprot.org/uniprot/Q8N726 Here the recommended name is "Tumor suppressor ARF" while alternative names are "Cyclin-dependent kinase inhibitor 2A" and "p14ARF". To me it seems that one gene codes two proteins that can be referred to by the same name.

I hope my edits haven't made more damage than good. Several P1889s would be nice.

I think, as someone suggested, that adding P1889 and having Wikibase merging looking at P1889 would be a solution.


/Finn


On 11/10/2015 12:34 AM, Benjamin Good wrote:
Magnus,

We are seeing more and more of these problematic merges.  See:
http://tinyurl.com/ovutz5x for the current list of (today 61) problems.
Are these coming from the wikidata game?

All of the editors performing the merges seem to be new and the edit
patterns seem to match the game.  I thought the edits were tagged with a
statement about them coming from the game, but I don't see that?  If
they are, could you just take genes and proteins out of the 'potential
merge' queue ?  I'm guessing that their frequently very similar names
are putting many of them into the list.

We are starting to work on a bot to combat this, but would like to stop
the main source of the damage if its possible to detect it.  This is ,
making Wikipedia integration more challenging than it already is...

thanks
-Ben


On Wed, Oct 28, 2015 at 3:41 PM, Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote:

    I fear my games may contribute to both problems (merging two items,
    and adding a sitelink to the wrong item). Both are facilitated by
    identical names/aliases, and sometimes it's hard to tell that a pair
    is meant to be different, especially if you don't know about the
    intricate structures of the respective knowledge domain.

    An item-specific, but somewhat heavy-handed approach would be to
    prevent merging of any two items where at least one has P1889, no
    matter what it specifically points to. At least, give a warning that
    an item is "merge-protected", and require an additional override for
    the merge.

    If that is acceptable, it would be easy for me to filter all items
    with P1889, from the merge game at least.

    On Wed, Oct 28, 2015 at 8:50 PM Peter F. Patel-Schneider
    <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote:

        On 10/28/2015 12:08 PM, Tom Morris wrote:
        [...]
         > Going back to Ben's original problem, one tool that Freebase
        used to help
         > manage the problem of incompatible type merges was a set of
        curated sets of
         > incompatible types [5] which was used by the merge tools to
        warn users that
         > the merge they were proposing probably wasn't a good idea.
        People could
         > ignore the warning in the Freebase implementation, but
        Wikidata could make it
         > a hard restriction or just a warning.
         >
         > Tom

        I think that this idea is a good one.  The incompatibility
        information  could
        be added to classes in the form of "this class is disjoint from
        that other
        class".  Tools would then be able to look for this information
        and produce
        warnings or even have stronger reactions to proposed merging.

        I'm not sure that using P1889 "different from" is going to be
        adequate.  What
        links would be needed?  Just between a gene and its protein?
        That wouldn't
        catch merging a gene and a related protein.  Between all genes
        and all
        proteins?  It seems to me that this is better handled at the
        class level.

        peter


        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata


    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata




_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



--
Finn Årup Nielsen
http://people.compute.dtu.dk/faan/

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Paul Houle

Applying Schemas for Natural Language Processing, Distributed Systems, Classification and Text Mining and Data Lakes

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com

:BaseKB -- Query Freebase Data With SPARQL

Legal Entity Identifier Lookup

Join our Data Lakes group on LinkedIn