Hi Marielle!

I replied on your post https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Let.27s_Do_SNPs.21 but here as well quickly.  

I also want this data in wikidata, but, after just attending the American Society for Human Genetics annual meeting, my suggestion is to be slightly patient with this.  A lot of people are working hard to standardize the nomenclature for variant identification (facing all the problems you describe above) and I don't think it will take a long time for it to stabilize.  (Famous last words.. but a lot of people are building tools that depend on this happening).  Once this is accomplished, we ought to be able to use the standard ids to anchor all the wikidata items for variants.  

In my opinion, this is a battle best fought over at the Human Genome Variation Society forum (http://www.hgvs.org/mutnomen/) and then applied within wikidata rather than the other way around.    

In the meantime, I'd encourage you to keep working on modeling all the claims you would want to see that use variant entities as you have already started doing.  

my two cents..
-Ben
 

On Sun, Oct 26, 2014 at 1:59 PM, Marielle Volz <marielle.volz@gmail.com> wrote:
This is awesome!

I'd love to have all SNPs on as well, and I started a discussion about
this on Wikiproject MB:
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Let.27s_Do_SNPs.21

I think this would be amazing, because single nucleotide polymorphisms
relate the genes to human diseases and traits, which are currently
both on Wikidata.

So for instance, we now have the gene
https://www.wikidata.org/wiki/Q18028243 which encodes the protein
product  https://www.wikidata.org/wiki/Q1738190, and we have the SNP
https://www.wikidata.org/wiki/Q18341737 IN that gene, which is
implicated in the disease https://www.wikidata.org/wiki/Q5712506.

This way we can get a fuller picture from wikidata how changes in
genes and gene products are related to the traits and diseases on
wikidata.

There are some things I'm really not sure how to handle however- each
SNP is a *location*, and in a diploid organism, each location has two
values, each of 4 different options (AGTC) and then each combination
of values may result in the same protein or a different one. So in the
case of the Kell antigen system, the rs8176058 location can be either
A or G. A nucleotide of A in this location codes for the 'K' antigen
or protein, and G encodes the 'k' antigen. This presents difficulties
with representing the information in a single "table" because common
variations AT the location have information that needs to be grouped
together.

In this case, it's simply the presence of an A or G that determines
the gene product, but of course this gets more complicated, where we
might not know strictly the "value" of A or G individually but may
only have "values" for each genotype (AG, AA, or GG) that may need to
be represented. And these genotypes might not always point to a
specific gene product, but may instead point to a qualitative trait
"increased risk of glaucoma" or a quantitative trait "vision was .2
diopters greater on average".

The two options are:

create a separate WD item for each "option"- i.e. "rs8176058-A" to
contain information about variation A at location rs8176058 (or, in
the case when information is known about the genotype, "AG genotype on
rs8176058")

OR

allow each option "A" or "AG" to be annotated with various fields. The
complication is that each annotation may be needed to be annotated
itself (and I don't think that's possible on WD) if we have multiple
pieces of quantitative information associated with one genotype. Hard
to say.

To see how this data is represented in table form elsewhere, you can
browse the GWAS catalog:

http://www.genome.gov/page.cfm?pageid=26525384&clearquery=1#result_table

Importing that might be a good start. There it looks something like this:

Risk allele: rs1230666-A
Effect: .0269 [0.014-0.039] unit increase
Implicated in: Serum thyroid peroxidase antibody levels
p-value: 2 x 10-8
reference: Medici M
February 27, 2014
PLoS Genet
Identification of novel genetic Loci associated with thyroid
peroxidase antibodies and clinical thyroid disease.

On Fri, Oct 24, 2014 at 1:24 AM, Lydia Pintscher
<lydia.pintscher@wikimedia.de> wrote:
> Hey folks :)
>
> Blog post is now available at
> http://blog.wikimedia.de/2014/10/22/establishing-wikidata-as-the-central-hub-for-linked-open-life-science-data/
> Thanks Benjamin and Andra!
>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l