Re: [Wikidata-l] How can I increase the throughput of ProteinBoxBot?

18 Oct 2014

      The suggestion to use wbeditentity was great. It took me some time to get
used to using that call, but finally I managed and the optimisation was
great. So great that we also finished including the Mouse genome,
yesterday. It only took 2 days to complete, in contrast to the weeks with
the human genome. The suggestion to use wbeditentity really made my day.
Adding the mouse genome to wikidata, did however resulted in ~1000
duplicates. [1]
The issue is that an items already existed with an identical identifier and
as such resulted in unique value violations [2]
In our current approach we can't prevent this, since the gene description
is currently key. We are looking into ways to use the identifier as key in
contrast to the label as we do now. The simplest option would be to add the
identifier as alias, but it would be ideal if we could use the same
algorithm as the one generating the constrained violations, before adding a
new item. Is this possible? Can a bot query for a claim P351 with a given
value (e.g. 1017).
Any input would be appreciated.
Regards,
Andra
[1] https://www.wikidata.org/wiki/User_talk:Andrawaag#.7E_1000_duplicates .
[2] (
https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violation...
On Tue, Sep 30, 2014 at 9:05 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
...
wrote:
...
What makes it so slow?
Note that you can use wbeditentity to perform complex edits with a single
api
call. It's not as streight forward to use as, say, wbaddclaim, but much
more
powerfull and efficient.
-- daniel
Am 30.09.2014 19:00, schrieb Andra Waagmeester:
...
Hi All,
  I have joined the development team of the ProteinBoxBot

(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make
Wikidata the canonical resource for referencing and translating
identifiers for
...
genes and proteins from different species.
Currently adding all genes from the human genome and their related
identifiers
...
to Wikidata takes more then a month to complete. With the objective to
add other
...
species, as well as having frequent updates for each of the genomes, it
would be
...
convenient if we could increase this throughput.
Would it be accepted if we increase the throughput by running multiple
instances
...
of ProteinBoxBot in parallel. If so, what would be an accepted number of
parallel instances of a bot to run? We can run multiple instances from
different
...
geographical locations if necessary.
Kind regards,
Andra

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] How can I increase the throughput of ProteinBoxBot?