[WikiEN-l] extracting protein target infobox information via page export

Andrew Gray andrew.gray at dunelm.org.uk
Wed Jan 19 15:10:23 UTC 2011


On 19 January 2011 14:29, Rajarshi Guha <rajarshi.guha at gmail.com> wrote:
> Hi, I was trying to extract some information from the protein target
> infobox on protein target pages (eg
> http://en.wikipedia.org/wiki/Calreticulin or
> http://en.wikipedia.org/wiki/Hsp90).
>
> However when I export the page via
> http://en.wikipedia.org/w/api.php?action=query&pageids=7120&export=&exportnowrap=
> the XML page does not seem to contain the information that I can see
> when viewing the page in the browser. For example, the XML export for
> Calreticulin does not contain the links to the rendering of the
> structure or the PDB identifiers and so on.
>
> Is my export URL wrong? Or is there a reason that the infobox
> information is not exported and if so, is there a way to access it via
> export?

The XML output is mainly the "plain" wikitext code of the page, rather
than the rendered text version. As a result, you don't get the
rendered version of the infobox, you just get the snippet of code
calling it:

{{PBB|geneid=811}}

This template is surprisingly simple - it takes the "geneid" number
and directs to a pre-generated specific subpage, in this case

http://en.wikipedia.org/wiki/Template:PBB/811

The gallery box at the bottom works in the same way:

{{PDB Gallery|geneid=811}}

directs you to

http://en.wikipedia.org/wiki/Template:PDB_Gallery/811

I am not immediately sure why these are seperate rather than
integrally part of the article, which is normal for infoboxes -
perhaps because it dissuades well-meaning but erroneous passing
alterations to the data, or because it simplifies maintenance. As
you've noticed, while it's transparent to the user, it's a little
confusing to working with!

It should be possible for you to pick the geneid number out of your
export and then run an additional export on Template:PBB/$number and
Template:PBB_Gallery/$number. Would that be sufficient?

-- 
- Andrew Gray
  andrew.gray at dunelm.org.uk



More information about the WikiEN-l mailing list