querying only some properties for WikibasePage

List overview All Threads
Download

newer

older

Pywikibot mailing list page...

Secret management

Ricordisamoa

13 Jun 2014 13 Jun '14

2:34 a.m.

I still can't find a way to query only some item properties (e.g. claims) from the Wikibase API using pywikibot. I've tried all of them:

.get('claims') .get(True, 'claims') .get(force=True, 'claims') .get('claims', force=True) .get('claims', True)

but without success. Bots often don't need all of the items' data, so such feature should be implemented (or, if present, documented better!) PS: is it worth to implement the 'wbgetclaims' action in our framework?

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

John Mark Vandenberg

13 Jun 13 Jun

3:55 a.m.

On Fri, Jun 13, 2014 at 9:34 AM, Ricordisamoa ricordisamoa@openmailbox.org wrote:

...

I still can't find a way to query only some item properties (e.g. claims) from the Wikibase API using pywikibot. I've tried all of them:

.get('claims') .get(True, 'claims') .get(force=True, 'claims') .get('claims', force=True) .get('claims', True)

but without success. Bots often don't need all of the items' data, so such feature should be implemented (or, if present, documented better!)

Are the claims a large part of the network traffic for items you are processing? Some client time might be saved by lazy loading the claim objects from _content. The claims data is even smaller when using raw revisions instead of the API JSON.

Often a large and unnecessary part of the item download is the labels and sitelinks, which are often full of duplicated information.

Most of them time the bot doesnt need every label and sitelink. What it does need is _one_ printable label to use in the user interface, and it wants 'the label closest to the UI language'.

For one of my tasks, I wrote a function to extract the 'most latin label' , but that depends on having all of the labels and sitelinks. It would be great if the API could provide something like this, so we could request that and not fetch every label and sitelink.

-- John Vandenberg

Ricordisamoa

4:13 a.m.

The PreloadingItemGenerator doesn't have any arguments to selectively query only some properties. If I don't need labels nor sitelinks nor aliases nor descriptions, why should I get them, wasting server- and client-side resources?

Il 13/06/2014 05:55, John Mark Vandenberg ha scritto:

...

Are the claims a large part of the network traffic for items you are processing? Some client time might be saved by lazy loading the claim objects from _content. The claims data is even smaller when using raw revisions instead of the API JSON.

Often a large and unnecessary part of the item download is the labels and sitelinks, which are often full of duplicated information.

Most of them time the bot doesnt need every label and sitelink. What it does need is _one_ printable label to use in the user interface, and it wants 'the label closest to the UI language'.

For one of my tasks, I wrote a function to extract the 'most latin label' , but that depends on having all of the labels and sitelinks. It would be great if the API could provide something like this, so we could request that and not fetch every label and sitelink.

Jeroen De Dauw

7:51 a.m.

Hey,

Are the claims a large part of the network traffic for items you are

...

processing? Some client time might be saved by lazy loading the claim objects from _content. The claims data is even smaller when using raw revisions instead of the API JSON.

Is the size of the serialization something that is causing problems?

Cheers

-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3

John Mark Vandenberg

16 Jun 16 Jun

2:13 a.m.

On Fri, Jun 13, 2014 at 2:51 PM, Jeroen De Dauw jeroendedauw@gmail.com wrote:

...

Hey,

...
Are the claims a large part of the network traffic for items you are processing? Some client time might be saved by lazy loading the claim objects from _content. The claims data is even smaller when using raw revisions instead of the API JSON.

Is the size of the serialization something that is causing problems?

Not serious problems IMO. e.g. Q60 is 54K via the API, but that is <10K gziped

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q60&lang...

Fetching only the claims only 'almost' halves the network traffic, but that results in the pywiki API cache not being as efficient if several labels or sitelinks are also fetched.

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q60&lang...

If a bot is only working with wikidata and a single language wiki, this is the 'optimal' query, which is 5.7Kb gzipped.

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q60&lang...

Prefetching many items also reduces the network activity as it lets gzip work harder.

-- John Vandenberg

3854

Age (days ago)

3857

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Jeroen De Dauw
John Mark Vandenberg
Ricordisamoa