extracting protein target infobox information via page export

List overview All Threads
Download

newer

older

Wikipedia status checkers

Analysis of 8 years of AfD...

Rajarshi Guha

19 Jan 2011 19 Jan '11

4:29 p.m.

Hi, I was trying to extract some information from the protein target infobox on protein target pages (eg http://en.wikipedia.org/wiki/Calreticulin or http://en.wikipedia.org/wiki/Hsp90).

However when I export the page via http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&... the XML page does not seem to contain the information that I can see when viewing the page in the browser. For example, the XML export for Calreticulin does not contain the links to the rendering of the structure or the PDB identifiers and so on.

Is my export URL wrong? Or is there a reason that the infobox information is not exported and if so, is there a way to access it via export?

Thanks,

-- Rajarshi Guha NIH Chemical Genomics Center

Show replies by date

Andrew Gray

19 Jan 19 Jan

5:10 p.m.

New subject: extracting protein target infobox information via page export

On 19 January 2011 14:29, Rajarshi Guha rajarshi.guha@gmail.com wrote:

...

Hi, I was trying to extract some information from the protein target infobox on protein target pages (eg http://en.wikipedia.org/wiki/Calreticulin or http://en.wikipedia.org/wiki/Hsp90).

However when I export the page via http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&... the XML page does not seem to contain the information that I can see when viewing the page in the browser. For example, the XML export for Calreticulin does not contain the links to the rendering of the structure or the PDB identifiers and so on.

Is my export URL wrong? Or is there a reason that the infobox information is not exported and if so, is there a way to access it via export?

The XML output is mainly the "plain" wikitext code of the page, rather than the rendered text version. As a result, you don't get the rendered version of the infobox, you just get the snippet of code calling it:

This template is surprisingly simple - it takes the "geneid" number and directs to a pre-generated specific subpage, in this case

http://en.wikipedia.org/wiki/Template:PBB/811

The gallery box at the bottom works in the same way:

directs you to

http://en.wikipedia.org/wiki/Template:PDB_Gallery/811

I am not immediately sure why these are seperate rather than integrally part of the article, which is normal for infoboxes - perhaps because it dissuades well-meaning but erroneous passing alterations to the data, or because it simplifies maintenance. As you've noticed, while it's transparent to the user, it's a little confusing to working with!

It should be possible for you to pick the geneid number out of your export and then run an additional export on Template:PBB/$number and Template:PBB_Gallery/$number. Would that be sufficient?

-- - Andrew Gray andrew.gray@dunelm.org.uk

Rajarshi Guha

5:18 p.m.

New subject: extracting protein target infobox information via page export

On Wed, Jan 19, 2011 at 10:10 AM, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

On 19 January 2011 14:29, Rajarshi Guha rajarshi.guha@gmail.com wrote:

The XML output is mainly the "plain" wikitext code of the page, rather than the rendered text version. As a result, you don't get the rendered version of the infobox, you just get the snippet of code calling it:

{{PBB|geneid=811}}

This template is surprisingly simple - it takes the "geneid" number and directs to a pre-generated specific subpage, in this case

http://en.wikipedia.org/wiki/Template:PBB/811

Thanks a lot for the pointer. This pretty much solves my problem - however, if I want to use the export method via the API, I need to provide a pageid - how would I programmatically obtain the pageid for the above page?

-- Rajarshi Guha NIH Chemical Genomics Center

Andrew Gray

5:26 p.m.

New subject: extracting protein target infobox information via page export

On 19 January 2011 15:18, Rajarshi Guha rajarshi.guha@gmail.com wrote:

...

Thanks a lot for the pointer. This pretty much solves my problem - however, if I want to use the export method via the API, I need to provide a pageid - how would I programmatically obtain the pageid for the above page?

The API is not my forte, but looking at

http://en.wikipedia.org/w/api.php

suggests that you can use titles rather than pageids, so that

http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&... http://en.wikipedia.org/w/api.php?action=query&titles=Calreticulin&e...

are both equivalent.

-- - Andrew Gray andrew.gray@dunelm.org.uk

Carcharoth

5:19 p.m.

New subject: extracting protein target infobox information via page export

On Wed, Jan 19, 2011 at 3:10 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

I am not immediately sure why these are seperate rather than integrally part of the article, which is normal for infoboxes - perhaps because it dissuades well-meaning but erroneous passing alterations to the data, or because it simplifies maintenance. As you've noticed, while it's transparent to the user, it's a little confusing to working with!

I'm curious as well. I'm also curious as to why the user wants to extract this information, given that they should (going by their signature) have access to databases that already have this sort of information (the sort of databases that should be supplying the information in the Wikipedia infoboxes). There probably is a reason, but I can't immediately think of one. Some Wikipedia infoboxes will provide information is a form not found elsewhere, but I don't think the protein infoboxes do, unless they are aggregating from different sources and we are the most convenient marriage of these sources?

Carcharoth

Rajarshi Guha

21 Jan 21 Jan

5:58 a.m.

New subject: extracting protein target infobox information via page export

On Jan 19, 2011, at 10:19 AM, Carcharoth wrote:

...

On Wed, Jan 19, 2011 at 3:10 PM, Andrew Gray <andrew.gray@dunelm.org.uk

...
wrote:

I'm curious as well. I'm also curious as to why the user wants to extract this information, given that they should (going by their signature) have access to databases that already have this sort of information (the sort of databases that should be supplying the information in the Wikipedia infoboxes). There probably is a reason, but I can't immediately think of one.

Partly because Wikipedia has done an aggregation on the multiple data sources

---------------------------------------------------- Rajarshi Guha | NIH Chemical Genomics Center http://www.rguha.net | http://ncgc.nih.gov ---------------------------------------------------- Say it with flowers - give her a triffid

Carcharoth

12:30 p.m.

New subject: extracting protein target infobox information via page export

On Fri, Jan 21, 2011 at 3:58 AM, Rajarshi Guha rajarshi.guha@gmail.com wrote:

...

On Jan 19, 2011, at 10:19 AM, Carcharoth wrote:

...
On Wed, Jan 19, 2011 at 3:10 PM, Andrew Gray <andrew.gray@dunelm.org.uk

...
wrote:

I'm curious as well. I'm also curious as to why the user wants to extract this information, given that they should (going by their signature) have access to databases that already have this sort of information (the sort of databases that should be supplying the information in the Wikipedia infoboxes). There probably is a reason, but I can't immediately think of one.

Partly because Wikipedia has done an aggregation on the multiple data sources

It might be better to extract links to the sources, rather than the actual data itself, which could be in a vandalised state at the time of extraction. What I guess I'm saying is that the data is better obtained from the sources, rather than Wikipedia. Or at the least cross-checking with the sources needs to be done, depending on what the data will be used for.

Carcharoth

5088

Age (days ago)

5090

Last active (days ago)

wikien-l@lists.wikimedia.org

6 comments

3 participants

tags (0)

participants (3)

Andrew Gray
Carcharoth
Rajarshi Guha