Hi, I was trying to extract some information from the protein target infobox on protein target pages (eg http://en.wikipedia.org/wiki/Calreticulin or http://en.wikipedia.org/wiki/Hsp90).
However when I export the page via http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&... the XML page does not seem to contain the information that I can see when viewing the page in the browser. For example, the XML export for Calreticulin does not contain the links to the rendering of the structure or the PDB identifiers and so on.
Is my export URL wrong? Or is there a reason that the infobox information is not exported and if so, is there a way to access it via export?
Thanks,
On 19 January 2011 14:29, Rajarshi Guha rajarshi.guha@gmail.com wrote:
Hi, I was trying to extract some information from the protein target infobox on protein target pages (eg http://en.wikipedia.org/wiki/Calreticulin or http://en.wikipedia.org/wiki/Hsp90).
However when I export the page via http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&... the XML page does not seem to contain the information that I can see when viewing the page in the browser. For example, the XML export for Calreticulin does not contain the links to the rendering of the structure or the PDB identifiers and so on.
Is my export URL wrong? Or is there a reason that the infobox information is not exported and if so, is there a way to access it via export?
The XML output is mainly the "plain" wikitext code of the page, rather than the rendered text version. As a result, you don't get the rendered version of the infobox, you just get the snippet of code calling it:
{{PBB|geneid=811}}
This template is surprisingly simple - it takes the "geneid" number and directs to a pre-generated specific subpage, in this case
http://en.wikipedia.org/wiki/Template:PBB/811
The gallery box at the bottom works in the same way:
{{PDB Gallery|geneid=811}}
directs you to
http://en.wikipedia.org/wiki/Template:PDB_Gallery/811
I am not immediately sure why these are seperate rather than integrally part of the article, which is normal for infoboxes - perhaps because it dissuades well-meaning but erroneous passing alterations to the data, or because it simplifies maintenance. As you've noticed, while it's transparent to the user, it's a little confusing to working with!
It should be possible for you to pick the geneid number out of your export and then run an additional export on Template:PBB/$number and Template:PBB_Gallery/$number. Would that be sufficient?
On Wed, Jan 19, 2011 at 10:10 AM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
On 19 January 2011 14:29, Rajarshi Guha rajarshi.guha@gmail.com wrote:
The XML output is mainly the "plain" wikitext code of the page, rather than the rendered text version. As a result, you don't get the rendered version of the infobox, you just get the snippet of code calling it:
{{PBB|geneid=811}}
This template is surprisingly simple - it takes the "geneid" number and directs to a pre-generated specific subpage, in this case
Thanks a lot for the pointer. This pretty much solves my problem - however, if I want to use the export method via the API, I need to provide a pageid - how would I programmatically obtain the pageid for the above page?
On 19 January 2011 15:18, Rajarshi Guha rajarshi.guha@gmail.com wrote:
Thanks a lot for the pointer. This pretty much solves my problem - however, if I want to use the export method via the API, I need to provide a pageid - how would I programmatically obtain the pageid for the above page?
The API is not my forte, but looking at
http://en.wikipedia.org/w/api.php
suggests that you can use titles rather than pageids, so that
http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&... http://en.wikipedia.org/w/api.php?action=query&titles=Calreticulin&e...
are both equivalent.
On Wed, Jan 19, 2011 at 3:10 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:
I am not immediately sure why these are seperate rather than integrally part of the article, which is normal for infoboxes - perhaps because it dissuades well-meaning but erroneous passing alterations to the data, or because it simplifies maintenance. As you've noticed, while it's transparent to the user, it's a little confusing to working with!
I'm curious as well. I'm also curious as to why the user wants to extract this information, given that they should (going by their signature) have access to databases that already have this sort of information (the sort of databases that should be supplying the information in the Wikipedia infoboxes). There probably is a reason, but I can't immediately think of one. Some Wikipedia infoboxes will provide information is a form not found elsewhere, but I don't think the protein infoboxes do, unless they are aggregating from different sources and we are the most convenient marriage of these sources?
Carcharoth
On Jan 19, 2011, at 10:19 AM, Carcharoth wrote:
On Wed, Jan 19, 2011 at 3:10 PM, Andrew Gray <andrew.gray@dunelm.org.uk
wrote:
I'm curious as well. I'm also curious as to why the user wants to extract this information, given that they should (going by their signature) have access to databases that already have this sort of information (the sort of databases that should be supplying the information in the Wikipedia infoboxes). There probably is a reason, but I can't immediately think of one.
Partly because Wikipedia has done an aggregation on the multiple data sources
---------------------------------------------------- Rajarshi Guha | NIH Chemical Genomics Center http://www.rguha.net | http://ncgc.nih.gov ---------------------------------------------------- Say it with flowers - give her a triffid
On Fri, Jan 21, 2011 at 3:58 AM, Rajarshi Guha rajarshi.guha@gmail.com wrote:
On Jan 19, 2011, at 10:19 AM, Carcharoth wrote:
On Wed, Jan 19, 2011 at 3:10 PM, Andrew Gray <andrew.gray@dunelm.org.uk
wrote:
I'm curious as well. I'm also curious as to why the user wants to extract this information, given that they should (going by their signature) have access to databases that already have this sort of information (the sort of databases that should be supplying the information in the Wikipedia infoboxes). There probably is a reason, but I can't immediately think of one.
Partly because Wikipedia has done an aggregation on the multiple data sources
It might be better to extract links to the sources, rather than the actual data itself, which could be in a vandalised state at the time of extraction. What I guess I'm saying is that the data is better obtained from the sources, rather than Wikipedia. Or at the least cross-checking with the sources needs to be done, depending on what the data will be used for.
Carcharoth