cc'ing xmldatadumpms-l on this.
Phil Adams wrote:
hi tomasz,
phil (philadams) here from #wikimedia-tech earlier today.
i'm interested in looking at user behaviour on wikipedia, so i figured
that the en wiki stub-meta-history would be a good place to start. i
grabbed and uncompressed the 2009 07/02 version, and started just
exploring it a little. i had a few questions:
* is this dump supposed to contain ALL revisions to each en wiki page
(articles and user pages in particular)? i ask b/c when i look at the
revision history for (say) AmericanSamoa, the meta dump shows only 5
or 6 revisions for that page, spread across time from 2001 to 2007.
the en wiki history page online
(
http://en.wikipedia.org/w/index.php?title=American_Samoa&action=history)
shows far more edits. what am i missing?
The XML files available on download are snapshots in time of our data
set. When each snapshot runs, the stub step gets a consistent view of
our database at that exact time. Any new revisions will only be
available in the next run.
AmericanSamoa is showing up just like it should in the snapshot because
its a redirect. If you take a look at
http://en.wikipedia.org/w/index.php?title=AmericanSamoa&action=history
then you will notice that it's only had a handful of edits when compared to
http://en.wikipedia.org/w/index.php?title=American%20Samoa&action=histo…
note the space between the two words.
* is there any sort of ordering to the history dump? it appears
nominally alphabetic, although isn't strictly alphabetic.
The ordering is by page id.
* if i have misunderstood the purpose of the meta dumps, but still
wanted the same information, is my best recourse simply to d/l the
entire en wiki dump? does that contain complete revision histories for
all pages?
The only difference between a stub and the full history is the full page
content. If you don't need the content then they are effectively the same.
--tomasz