Hi,
Here at CU we work with corpora of text to train models that 'understand' language (see, e.g., LSA.colorado.edu). We wanted to use Wikipedia to create a copyright-free corpus of text that anyone in the scientific community could use. To do that we downloaded the DB dumps a while ago ( about 2 billion words), but due to a computer problem, we lost them.
I have noticed that the link to the full english database (2280MB): http://download.wikipedia.org/archives/en/20031125_old_table.sql.bz2
doesn't work anymore; it returns a Forbidden error, says that you don't have permission to access /archives/en/20031125_old_table.sql.bz2 on this server
Could you please grant us access to the file?
Thanks a lot in advance, -Jose
On Nov 27, 2003, at 14:35, Jose Quesada wrote:
I have noticed that the link to the full english database (2280MB): http://download.wikipedia.org/archives/en/20031125_old_table.sql.bz2
Yes, this needs to get marked appropriately on the download page, sorry!
The version of apache we're running absolutely refuses to deal with files over 2 gigabytes. Until this is fixed, a copy of the file is available split in two pieces as: http://download.wikipedia.org/archives/en/xaa http://download.wikipedia.org/archives/en/xab
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org