Hello,
I trying to get earliest revision dates for a set of approximately 4,000,000 Wikipedia articles. Looking at the MediaWiki API, it seems that the only way to get information on earliest revisions is to query on one article at a time. For 4,000,000 articles, this will take far too long...
So, my question is: is there any way to query for first revisions in batches? Alternatively, if there some other source of this information, such as data dumps? Since I am only looking for the timestamps of the first revision of each article, I'd like to avoid downloading complete histories for each article.
Thanks, Ryan
Ryan Shaw wrote:
Hello,
I trying to get earliest revision dates for a set of approximately 4,000,000 Wikipedia articles. Looking at the MediaWiki API, it seems that the only way to get information on earliest revisions is to query on one article at a time. For 4,000,000 articles, this will take far too long...
So, my question is: is there any way to query for first revisions in batches? Alternatively, if there some other source of this information, such as data dumps? Since I am only looking for the timestamps of the first revision of each article, I'd like to avoid downloading complete histories for each article.
Thanks, Ryan
You can download the stub dumps. They contain metadata for all revisions, but no revision text.
On Tue, Mar 3, 2009 at 11:08 AM, Ryan Shaw ryanshaw@ischool.berkeley.edu wrote:
So, my question is: is there any way to query for first revisions in batches? Alternatively, if there some other source of this information, such as data dumps? Since I am only looking for the timestamps of the first revision of each article, I'd like to avoid downloading complete histories for each article.
You could use the query service on the toolserver:
https://wiki.toolserver.org/view/Query_service
The toolserver doesn't keep the content of revisions, but you only want metadata anyway so that's not a problem.
Using the split-stub dumps, or the toolserver are both good ideas.
But suppose you wanted to use the API; the question would be: do you want the exact timestamp for the creation of the page, or are you looking for the date? Given the answer to be that you want the date(s), you might proceed like this:
For every 1000th pageid, get the earliest rev and note the date and time. (If a given ID is missing, i.e. deleted, hunt around it, +1, -1, +2, -2 etc 'till you find one.)
To get the date and time for a particular pageid, interpolate between the next higher and lower 1000th. This should pretty much always get you the correct date, with some chance of it being off by one for pages created near midnight UTC.
A bit of testing will show you if this is accurate enough for your purpose; read the 1000ths for a small range, then pick some random pageids within that range and compare the interpolation with the actual first revision time.
best, Robert
On Wed, Mar 4, 2009 at 5:05 AM, Robert Ullmann rlullmann@gmail.com wrote:
For every 1000th pageid, get the earliest rev and note the date and time. (If a given ID is missing, i.e. deleted, hunt around it, +1, -1, +2, -2 etc 'till you find one.)
To get the date and time for a particular pageid, interpolate between the next higher and lower 1000th. This should pretty much always get you the correct date, with some chance of it being off by one for pages created near midnight UTC.
That's a clever idea. As it turns out, using the stub dump wasn't bad; I was able to assign earliest revision dates and revision counts to the articles in WEX overnight.
mediawiki-api@lists.wikimedia.org