Hi,
How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.
Thanks, Delip
2009/11/20 Delip Rao deliprao@gmail.com
Hi,
How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.
I had thought all dumps were archived at download.wikimedia.org - but clearly not. Internet Archive have some < http://www.archive.org/search.php?query=enwiki%3E, but not the specific one you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep historical records? But this would be quite an oversight if there are no backups on Wikimedia's servers.
Cormac
On Fri, Nov 20, 2009 at 9:04 AM, Cormac Lawler cormaggio@gmail.com wrote:
I had thought all dumps were archived at download.wikimedia.org - but clearly not. Internet Archive have some http://www.archive.org/search.php?query=enwiki, but not the specific one you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep historical records? But this would be quite an oversight if there are no backups on Wikimedia's servers.
The closest I appear to have is enwiki-20060702-pages-meta-history.xml.7z
On Fri, Nov 20, 2009 at 9:09 AM, Gregory Maxwell gmaxwell@gmail.com wrote:
On Fri, Nov 20, 2009 at 9:04 AM, Cormac Lawler cormaggio@gmail.com wrote:
I had thought all dumps were archived at download.wikimedia.org - but clearly not. Internet Archive have some http://www.archive.org/search.php?query=enwiki, but not the specific
one
you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas)
keep
historical records? But this would be quite an oversight if there are no backups on Wikimedia's servers.
The closest I appear to have is enwiki-20060702-pages-meta-history.xml.7z
Thanks everyone for the replies. I was looking for the 9/11/2006 dump to
reproduce a previous research result. But I think 20060702 should be close enough.
Gregory, Do you have a URL where I can access it?
Best, Delip
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hoi, At the time it was published that the WMF would not retain old backups. This had to do with the cost of storage at the time and a lack of perceived value of these backups. Thanks, GerardM
2009/11/20 Cormac Lawler cormaggio@gmail.com
2009/11/20 Delip Rao deliprao@gmail.com
Hi,
How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.
I had thought all dumps were archived at download.wikimedia.org - but clearly not. Internet Archive have some < http://www.archive.org/search.php?query=enwiki%3E, but not the specific one you're looking for. Perhaps one of the techies (eg Greg Maxwell, Domas) keep historical records? But this would be quite an oversight if there are no backups on Wikimedia's servers.
Cormac
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
You can just get the fresh dumps and query appropriately.
Hth, denny
On Nov 20, 2009, at 6:43, Delip Rao wrote:
Hi,
How do I get access to older wikipedia dumps? In particular, I am looking for the dump from 9/11/2006. Any help is much appreciated.
Thanks, Delip _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
Almost redundant :).
You can just get the fresh dumps and query appropriately.
Except for the one that you can't get.
On Nov 20, 2009, at 16:38, Anthony wrote:
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
Almost redundant :).
Correct -- there is a small amount of data that is *really* deleted, but my gut feeling is that this is less than 0.1% of all revisions. This would need some evaluation, though.
Or do you mean something else?
You can just get the fresh dumps and query appropriately.
Except for the one that you can't get.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Fri, Nov 20, 2009 at 10:42 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
On Nov 20, 2009, at 16:38, Anthony wrote:
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
Almost redundant :).
Correct -- there is a small amount of data that is *really* deleted, but my gut feeling is that this is less than 0.1% of all revisions. This would need some evaluation, though.
Or do you mean something else?
No, that's what I mean, though I'm not sure if it's less than 0.1% (I don't have any guess at all on the percentage). When an article is "deleted" (set as deleted by an admin, which isn't even *really* deleted), all revisions are removed from the public portion of the database, which is where the dump comes from. Then, making up a much much smaller portion of the material that isn't there, there are oversighted revisions and individually deleted revisions.
I believe page moves (after a certain date?) are recorded in the logs. They wouldn't be in the history dump itself, but they could potentially be backed into by reading the logs.
The main thing that would be missing, and that can't be reconstructed from the newer dumps, would be deleted articles. 0.1%, weighted by number of revisions? I have absolutely no idea. I think the number of deleted revisions is available to the public (through a toolserver app) though, so we could probably calculate it.
On Fri, Nov 20, 2009 at 10:57 AM, Anthony wikimail@inbox.org wrote:
The main thing that would be missing, and that can't be reconstructed from the newer dumps, would be deleted articles. 0.1%, weighted by number of revisions? I have absolutely no idea.
By the way, depending on what you're using the data for, this may or may not be significant. For instance, if you're measuring vandalism, even a small percentage of missing data might be significant, because there is likely to be a high correlation to articles which are deleted and articles which were vandalized.
On Fri, Nov 20, 2009 at 16:38, Anthony wikimail@inbox.org wrote:
On Fri, Nov 20, 2009 at 9:25 AM, Denny Vrandecic denny.vrandecic@kit.edu wrote:
The newer dump should include almost all material from the older dumps, so the older dumps are redundant.
Almost redundant :).
You can just get the fresh dumps and query appropriately.
Except for the one that you can't get.
I think the main problem is that for enwiki, only the current page text is included in the dump, not the older revisions.
pages-meta-history.xml is supposed to contain the old revisions, but for enwiki, it can't be downloaded anymore. I believe it simply got too big. For example, the current enwiki dump progress page [1] displays "ETA 2010-02-12 17:21:11" for pages-meta-history.xml.bz2, and the pages for completed dumps, e.g. [2], don't include pages-meta-history.xml at all.
For the smaller wikis, e.g. dewiki [3], pages-meta-history.xml is still available.
Christopher
[1] http://download.wikimedia.org/enwiki/20091103/ [2] http://download.wikimedia.org/enwiki/20091026/ [3] http://download.wikimedia.org/dewiki/20091110/
--- El vie, 20/11/09, Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com escribió:
De: Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com
pages-meta-history.xml is supposed to contain the old revisions, but for enwiki, it can't be downloaded anymore.
As far as I know, this is not a precise statement. It's not that they can't be downloaded *anymore*, but they can't be retrieved *yet*.
WMF tech staff has been working on this issue over the past months, and they've promised us several times that, as soon as they find out an appropriate solution for the complexity of this task, complete dumps for enwiki will be available again for all of us researchers eagerly waiting to put our hands on them :-).
Best, Felipe.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org