[Wikidata-l] How can I increase the throughput of ProteinBoxBot?

List overview All Threads
Download

newer

older

[Wikidata-l] Special:Export broken?

[Wikidata-l] Structured Data...

Andra Waagmeester

30 Sep 2014 30 Sep '14

noon

Hi All,

I have joined the development team of the ProteinBoxBot ( https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make Wikidata the canonical resource for referencing and translating identifiers for genes and proteins from different species.

Currently adding all genes from the human genome and their related identifiers to Wikidata takes more then a month to complete. With the objective to add other species, as well as having frequent updates for each of the genomes, it would be convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running multiple instances of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from different geographical locations if necessary.

Kind regards,

Andra

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Daniel Kinzler

30 Sep 30 Sep

2:05 p.m.

What makes it so slow?

Note that you can use wbeditentity to perform complex edits with a single api call. It's not as streight forward to use as, say, wbaddclaim, but much more powerfull and efficient.

-- daniel

Am 30.09.2014 19:00, schrieb Andra Waagmeester:

...

Hi All,
  I have joined the development team of the ProteinBoxBot
(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make Wikidata the canonical resource for referencing and translating identifiers for genes and proteins from different species.

Currently adding all genes from the human genome and their related identifiers to Wikidata takes more then a month to complete. With the objective to add other species, as well as having frequent updates for each of the genomes, it would be convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running multiple instances of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from different geographical locations if necessary.

Kind regards,

Andra

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Denny Vrandečić

2:13 p.m.

That's very cool! To get an idea, how big is your dataset?

On Tue Sep 30 2014 at 12:06:56 PM Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

...

What makes it so slow?

Note that you can use wbeditentity to perform complex edits with a single api call. It's not as streight forward to use as, say, wbaddclaim, but much more powerfull and efficient.

-- daniel

Am 30.09.2014 19:00, schrieb Andra Waagmeester:

...
Hi All,
  I have joined the development team of the ProteinBoxBot
(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make Wikidata the canonical resource for referencing and translating
identifiers for

...
genes and proteins from different species.

Currently adding all genes from the human genome and their related

identifiers

...
to Wikidata takes more then a month to complete. With the objective to

add other

...
species, as well as having frequent updates for each of the genomes, it

would be

...
convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running multiple

instances

...
of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from

different

...
geographical locations if necessary.

Kind regards,

Andra

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Andra Waagmeester

1 Oct 1 Oct

3:56 a.m.

There are about 47000 genes. In the first step the bot check if an entry already exists and, if not a new entry is made subsequently three claims are added (Entrez Gene ID (P351) https://www.wikidata.org/wiki/Property:P351#top found in taxon (P703) https://www.wikidata.org/wiki/Property:P703#top, subclass of (P279)) as well as synonyms. Currently this process takes a week to complete. In a second phase identifiers for each gene are obtained and added as respective claims. Typically this ranges from 1 claim per property to up to 20 claims per property (This is a rough estimate). This bot has been running for 2 weeks and is currently at 74% of all genes covered.

Currently each entity creation and subsequent claims are unique API calls. So using the wbeditentity will probably result in an improvement. Thanks for the suggestion.

Cheers, Andra

On Tue, Sep 30, 2014 at 9:05 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

What makes it so slow?

Note that you can use wbeditentity to perform complex edits with a single api call. It's not as streight forward to use as, say, wbaddclaim, but much more powerfull and efficient.

-- daniel

Am 30.09.2014 19:00, schrieb Andra Waagmeester:

...
Hi All,
  I have joined the development team of the ProteinBoxBot
(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make Wikidata the canonical resource for referencing and translating
identifiers for

...
genes and proteins from different species.

Currently adding all genes from the human genome and their related

identifiers

...
to Wikidata takes more then a month to complete. With the objective to

add other

...
species, as well as having frequent updates for each of the genomes, it

would be

...
convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running multiple

instances

...
of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from

different

...
geographical locations if necessary.

Kind regards,

Andra

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Katie Filbert

5:11 a.m.

On Wed, Oct 1, 2014 at 10:56 AM, Andra Waagmeester andra@micelio.be wrote:

...

There are about 47000 genes. In the first step the bot check if an entry already exists and, if not a new entry is made subsequently three claims are added (Entrez Gene ID (P351) https://www.wikidata.org/wiki/Property:P351#top found in taxon (P703) https://www.wikidata.org/wiki/Property:P703#top, subclass of (P279)) as well as synonyms. Currently this process takes a week to complete. In a second phase identifiers for each gene are obtained and added as respective claims. Typically this ranges from 1 claim per property to up to 20 claims per property (This is a rough estimate). This bot has been running for 2 weeks and is currently at 74% of all genes covered.

Currently each entity creation and subsequent claims are unique API calls. So using the wbeditentity will probably result in an improvement. Thanks for the suggestion.

When running the bot, please keep in mind the change dispatch lag, and ensure it doesn't get too high:

https://www.wikidata.org/wiki/Special:DispatchStats

which can be accessed via api:

https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop...

Cheers, Katie

...

Cheers, Andra

On Tue, Sep 30, 2014 at 9:05 PM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

...
What makes it so slow?

Note that you can use wbeditentity to perform complex edits with a single api call. It's not as streight forward to use as, say, wbaddclaim, but much more powerfull and efficient.

-- daniel

Am 30.09.2014 19:00, schrieb Andra Waagmeester:

...
Hi All,
  I have joined the development team of the ProteinBoxBot
(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to
make

...
Wikidata the canonical resource for referencing and translating

identifiers for

...
genes and proteins from different species.

Currently adding all genes from the human genome and their related

identifiers

...
to Wikidata takes more then a month to complete. With the objective to

add other

...
species, as well as having frequent updates for each of the genomes, it

would be

...
convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running multiple

instances

...
of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from

different

...
geographical locations if necessary.

Kind regards,

Andra

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Katie Filbert Wikidata Developer Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin Phone (030) 219 158 26-0 http://wikimedia.de Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.

Jeroen De Dauw

2:46 p.m.

Hey,

...

Currently each entity creation and subsequent claims are unique API

calls. So using the wbeditentity will probably result in an improvement. Thanks for the suggestion.

I second that suggestion. It should definitely not take 2 weeks or more to add a mere 50k items. In case your bot is PHP or could easily do some PHP, then [0] and [1] are probably of use to you.

[0] https://github.com/wmde/WikibaseDataModel [1] https://github.com/wmde/WikibaseDataModelSerialization

Cheers

-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

Andra Waagmeester

17 Oct 17 Oct

4:29 p.m.

The suggestion to use wbeditentity was great. It took me some time to get used to using that call, but finally I managed and the optimisation was great. So great that we also finished including the Mouse genome, yesterday. It only took 2 days to complete, in contrast to the weeks with the human genome. The suggestion to use wbeditentity really made my day.

Adding the mouse genome to wikidata, did however resulted in ~1000 duplicates. [1]

The issue is that an items already existed with an identical identifier and as such resulted in unique value violations [2]

In our current approach we can't prevent this, since the gene description is currently key. We are looking into ways to use the identifier as key in contrast to the label as we do now. The simplest option would be to add the identifier as alias, but it would be ideal if we could use the same algorithm as the one generating the constrained violations, before adding a new item. Is this possible? Can a bot query for a claim P351 with a given value (e.g. 1017).

Any input would be appreciated.

Regards,

Andra

[1] https://www.wikidata.org/wiki/User_talk:Andrawaag#.7E_1000_duplicates . [2] ( https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violation...

On Tue, Sep 30, 2014 at 9:05 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

What makes it so slow?

Note that you can use wbeditentity to perform complex edits with a single api call. It's not as streight forward to use as, say, wbaddclaim, but much more powerfull and efficient.

-- daniel

Am 30.09.2014 19:00, schrieb Andra Waagmeester:

...
Hi All,
  I have joined the development team of the ProteinBoxBot
(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make Wikidata the canonical resource for referencing and translating
identifiers for

...
genes and proteins from different species.

Currently adding all genes from the human genome and their related

identifiers

...
to Wikidata takes more then a month to complete. With the objective to

add other

...
species, as well as having frequent updates for each of the genomes, it

would be

...
convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running multiple

instances

...
of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from

different

...
geographical locations if necessary.

Kind regards,

Andra

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

zehetner

18 Oct 18 Oct

4:16 a.m.

Great work Andra! Is there any possibility to add for Ensembl related properties (Gene ID, Transcript ID etc.) the version of Ensembl from which these Ids are extracted (maybe by adding a qualifier to the ID value) as Entrez seems to provide this information? Between Ensembl versions these IDs can change in some cases very drastically, so a gene can have a completely different ID or an ID can point to a different item in two Ensembl versions. Without knowing the Ensembl version number these IDs are not that useful especially when trying to correlate them with older published IDs.

Thanks, Günther

On Fri, 17 Oct 2014 23:29:20 +0200, Andra Waagmeester andra@micelio.be wrote:

...

The suggestion to use wbeditentity was great. It took me some time to

get

...

used to using that call, but finally I managed and the optimisation was great. So great that we also finished including the Mouse genome, yesterday. It only took 2 days to complete, in contrast to the weeks

with

...

the human genome. The suggestion to use wbeditentity really made my day.

Adding the mouse genome to wikidata, did however resulted in ~1000 duplicates. [1]

The issue is that an items already existed with an identical identifier

and

...

as such resulted in unique value violations [2]

In our current approach we can't prevent this, since the gene

description

...

is currently key. We are looking into ways to use the identifier as key

...

contrast to the label as we do now. The simplest option would be to add

the

...

identifier as alias, but it would be ideal if we could use the same algorithm as the one generating the constrained violations, before

adding a

...

new item. Is this possible? Can a bot query for a claim P351 with a

given

...

value (e.g. 1017).

Any input would be appreciated.

Regards,

Andra

[1]

https://www.wikidata.org/wiki/User_talk:Andrawaag#.7E_1000_duplicates .

...

[2] (

https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violation...

...

On Tue, Sep 30, 2014 at 9:05 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...
wrote:

...
What makes it so slow?

Note that you can use wbeditentity to perform complex edits with a

single

...

...
api call. It's not as streight forward to use as, say, wbaddclaim, but much more powerfull and efficient.

-- daniel

Am 30.09.2014 19:00, schrieb Andra Waagmeester:

...
Hi All,
  I have joined the development team of the ProteinBoxBot
(https://www.wikidata.org/wiki/User:ProteinBoxBot) . Our goal is to make Wikidata the canonical resource for referencing and translating
identifiers for

...
genes and proteins from different species.

Currently adding all genes from the human genome and their related

identifiers

...
to Wikidata takes more then a month to complete. With the objective

...

...
add other

...
species, as well as having frequent updates for each of the genomes,

...

...
would be

...
convenient if we could increase this throughput.

Would it be accepted if we increase the throughput by running

multiple

...

...
instances

...
of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances

from

...

...
different

...
geographical locations if necessary.

Kind regards,

Andra

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Legoktm

3 Oct 3 Oct

3:31 p.m.

On 9/30/14 10:00 AM, Andra Waagmeester wrote:

...

Would it be accepted if we increase the throughput by running multiple instances of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from different geographical locations if necessary.

https://www.mediawiki.org/wiki/API:Etiquette recommends that you don't make parallel requests. But if you're just going to run another instance of your bot I doubt it would cause any problems.

Could you provide a link to your source code (I didn't see a link on the bot's userpage)? IIRC, you might be using pywikibot, but I don't remember exactly. Myself and other pywikibot developers can help you optimize your code :)

-- Legoktm

Ricordisamoa

18 Oct 18 Oct

5:28 a.m.

Il 03/10/2014 22:31, Legoktm ha scritto:

...

On 9/30/14 10:00 AM, Andra Waagmeester wrote:

...
Would it be accepted if we increase the throughput by running multiple instances of ProteinBoxBot in parallel. If so, what would be an accepted number of parallel instances of a bot to run? We can run multiple instances from different geographical locations if necessary.

https://www.mediawiki.org/wiki/API:Etiquette recommends that you don't make parallel requests. But if you're just going to run another instance of your bot I doubt it would cause any problems.

Could you provide a link to your source code (I didn't see a link on the bot's userpage)? IIRC, you might be using pywikibot, but I don't remember exactly. Myself and other pywikibot developers can help you optimize your code :)

-- Legoktm

I suppose the latest version is https://bitbucket.org/chinmay26/proteinboxbot/. Pywikibot recently introduced a very simple way of manipulating items and sending back the changes via wbeditentity. If the whole Gene Wiki Project codebase were hosted on gerrit.wikimedia.org, more people could help ;-)

3565

Age (days ago)

3583

Last active (days ago)

wikidata@lists.wikimedia.org

9 comments

8 participants

tags (0)

participants (8)

Andra Waagmeester
Daniel Kinzler
Denny Vrandečić
Jeroen De Dauw
Katie Filbert
Legoktm
Ricordisamoa
zehetner