cc'ing xmldatadumpms-l on this.
Phil Adams wrote:
hi tomasz,
phil (philadams) here from #wikimedia-tech earlier today.
i'm interested in looking at user behaviour on wikipedia, so i figured that the en wiki stub-meta-history would be a good place to start. i grabbed and uncompressed the 2009 07/02 version, and started just exploring it a little. i had a few questions:
- is this dump supposed to contain ALL revisions to each en wiki page
(articles and user pages in particular)? i ask b/c when i look at the revision history for (say) AmericanSamoa, the meta dump shows only 5 or 6 revisions for that page, spread across time from 2001 to 2007. the en wiki history page online (http://en.wikipedia.org/w/index.php?title=American_Samoa&action=history) shows far more edits. what am i missing?
The XML files available on download are snapshots in time of our data set. When each snapshot runs, the stub step gets a consistent view of our database at that exact time. Any new revisions will only be available in the next run.
AmericanSamoa is showing up just like it should in the snapshot because its a redirect. If you take a look at
http://en.wikipedia.org/w/index.php?title=AmericanSamoa&action=history
then you will notice that it's only had a handful of edits when compared to
http://en.wikipedia.org/w/index.php?title=American%20Samoa&action=histor...
note the space between the two words.
- is there any sort of ordering to the history dump? it appears
nominally alphabetic, although isn't strictly alphabetic.
The ordering is by page id.
- if i have misunderstood the purpose of the meta dumps, but still
wanted the same information, is my best recourse simply to d/l the entire en wiki dump? does that contain complete revision histories for all pages?
The only difference between a stub and the full history is the full page content. If you don't need the content then they are effectively the same.
--tomasz
xmldatadumps-l@lists.wikimedia.org