Dealing with Large Files when attempting a wikipedia database download.

List overview All Threads
Download

newer

older

SUL user names, how can I tell...

Collection extension

Jameson Scanlon

10 Apr 2009 10 Apr '09

6:21 p.m.

Does anyone on the wikitech mailing list happen to know whether it would be possible for some of the larger wikipedia database downloads (which are, say, 16GB or so in size) to be split into parts so that they can be downloaded. For whatever reason, whenever I have attempted to download the ~14GB files (say, from http://static.wikipedia.org/downloads/2008-06/en/ ), I have found that only 2GB (presumably, the first 2GB) of what I have sought to download has actually been downloaded. Is there anyway around this? Could anyone possibly suggest what possible reasons there might be for this difficulty in downloading the material? Thanks. .

Show replies by date

David Gerard

10 Apr 10 Apr

6:25 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

2009/4/10 Jameson Scanlon <jameson.scanlon(a)googlemail.com>om>:

...

Downloading to a filesystem that only does maximum 2GB files? - d.

Daniel Kinzler

7:43 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

David Gerard schrieb:

...

2009/4/10 Jameson Scanlon <jameson.scanlon(a)googlemail.com>om>:

Downloading to a filesystem that only does maximum 2GB files?

Also, several http clients don't like files over 2GB - this is because the large number of bytes in the Length field causes an integer overflow (2GB is the 31 bit limit). wget likes to die with a segmentation fault on those. I found that curl works. But of course, the file system also has to support very large files, as Gerard said. Finally: yes, it would be nive to have such dumps available in pieces of perhaps 1GB in size. -- daniel

Finne Boonen

7:49 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

http://en.wikipedia.org/wiki/Wikipedia_database has some information on how to deal with the large files henna On Fri, Apr 10, 2009 at 21:43, Daniel Kinzler <daniel(a)brightbyte.de> wrote:

...

David Gerard schrieb:

2009/4/10 Jameson Scanlon <jameson.scanlon(a)googlemail.com>om>:

Downloading to a filesystem that only does maximum 2GB files?

-- "Maybe you knew early on that your track went from point A to B, but unlike you I wasn't given a map at birth!" Alyssa, "Chasing Amy"

Bilal Abdul Kader

8 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

I have downloaded the history dump file (~150 GB) using Firefox on XP and using wget on Ubuntu and it works fine. I have downloaded it using a download manager on Vista and it is fine also. A more probable reason is the file system limitations. bilal On Fri, Apr 10, 2009 at 3:49 PM, Finne Boonen <hennar(a)gmail.com> wrote:

...

http://en.wikipedia.org/wiki/Wikipedia_database has some information on how to deal with the large files henna On Fri, Apr 10, 2009 at 21:43, Daniel Kinzler <daniel(a)brightbyte.de> wrote:

David Gerard schrieb:

2009/4/10 Jameson Scanlon <jameson.scanlon(a)googlemail.com>om>:

Downloading to a filesystem that only does maximum 2GB files?

Also, several http clients don't like files over 2GB - this is because

the large

number of bytes in the Length field causes an integer overflow (2GB is

the 31

bit limit). wget likes to die with a segmentation fault on those. I found

that

curl works. But of course, the file system also has to support very large files, as

Gerard said.

Finally: yes, it would be nive to have such dumps available in pieces of

perhaps

1GB in size. -- daniel _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

14 Apr 14 Apr

9:45 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

Daniel Kinzler schrieb:

...

wget supports such big files. Perhaps you're running an old version? IMHO the benefits of separated files are similar to the disadvantages. A side side benefit if it would be that hashes would be splitted, too. If you were unlucky, knowing that 'something' (perhaps just a bit) on the 150GB you downloaded is wrong, is not that helpful. So having hashes for file sections on the big ones, even if not 'standard' would be an improvement.

Petr Kadlec

15 Apr 15 Apr

12:06 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

2009/4/14 Platonides <Platonides(a)gmail.com>om>:

...

IMHO the benefits of separated files are similar to the disadvantages. A side side benefit if it would be that hashes would be splitted, too. If you were unlucky, knowing that 'something' (perhaps just a bit) on the 150GB you downloaded is wrong, is not that helpful. So having hashes for file sections on the big ones, even if not 'standard' would be an improvement.

For that, something like Parchive would probably be better… -- [[cs:User:Mormegil | Petr Kadlec]]

Brian

10 Apr 10 Apr

11:07 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

I'm pretty sure it's impossible to encourage people to include relevant information in their OPs. You don't suppose you could have at least told us your operating system, whether you are running 32 or 64 bits? Are you on linux with no large file support? On Fri, Apr 10, 2009 at 12:21 PM, Jameson Scanlon < jameson.scanlon(a)googlemail.com> wrote:

...

Jameson Scanlon

11 Apr 11 Apr

7:50 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

I should state some of the following items of info in response to the email correspondence received : 1) Windows version information (I am not providing the full 'winver' response obtained, because its probably not necessary – all that I imagine that you'd need to know is the approximate windows OS upon which I am attempting to download the relevant information). Microsoft (R) Windows Version 5.1 (Build 2600.xpsp_sp3_gdr.080814-1236 : Service Pack 3) Copyright (C) 2007 Microsoft Corporation BLA BLA Physical memory available to Windows : 1,038,404 KB 2) In terms of whether the OS is 32bit or 64 bit, after using the dxdiag, sysdm and winmsd methods, I believe it is a 32 bit OS. I don't know if this helps figure out why the download might cut out at an amount less than the full file download size of 14GB. ALSO, I should have stated in my original statement that sometimes it is possible for me to download more than 4GB, but that (for some reason or other) the download cuts out (dunno why). 3) As a separate point, it occurs to me that one of the reasons for why the download might cut out is that there are a sequence of servers (according to tracert) upon which I rely for the download to proceed. I could be wrong, but all it may take is one server (for whatever reason) deciding that the download is problematic for the whole file download to fail. 4) Also, the version of mozilla firefox used is 3.0.7 (thought I am not sure whether this would explain why the download usually cuts out at arbitrary points). 5) Is there some type of timeout command lying somewhere which might instruct the wikipedia server to quit a particular attempt to download a large file if it is taking too long? I should also state that I use a system for which I do not have administrative rights (though why this would cause the download to cut out is anyone's guess). It also seems like a good idea to split large files up using a file splitter (whichever one takes your fancy) as larger file downloads would seem to be problematic for most people who have access to networks with only a limited connection speed. It occurs to me that, given the randomness of this problem, this response might also be correspondingly random. Still, how long might it take to organise something in the way of a (perhaps unix script automated?) file splitting for the larger wikipedia database download files? Many thanks to everyone for their responses so far to this query. PS – If it were ever the case that bit torrent were used for the dissemination of large files (there has been some mention of this on the wikipedia database download talk page), I can still imagine that there might be problems with trying to propagate the WHOLE of such a large file (~14GB) – though this assertion might run contrary to other peoples experiences. Anyhow, it occurs to me that, for the interests of redundancy, it would be worthwhile to figure out whether there's a way of changing the structure of the wikipedia database download so that, even if only the first 1GB of the database were downloaded, it would still be possible to read the information on it (perhaps this is already the case – but, from what I gather, once an incomplete database dump is downloaded – it is pretty useless, unless someone can correct me). On 4/11/09, Brian <Brian.Mingus(a)colorado.edu> wrote:

...

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ilmari Karonen

8:33 p.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

Jameson Scanlon wrote:

...

The information that might actually be relevant is whether the disk you're trying to download the dump to is using FAT or NTFS. FAT32 only supports files up to 4 GiB, while NTFS should be able to handle larger files.

...

I should have stated in my original statement that sometimes it is possible for me to download more than 4GB, but that (for some reason or other) the download cuts out (dunno why).

Well, if so, that does kind of suggest that it's not the file system that's the problem.

...

3) As a separate point, it occurs to me that one of the reasons for why the download might cut out is that there are a sequence of servers (according to tracert) upon which I rely for the download to proceed. I could be wrong, but all it may take is one server (for whatever reason) deciding that the download is problematic for the whole file download to fail.

The servers listed by tracert are only passing IP data packets between your computer and Wikimedia's server. They don't know or care if you're downloading one big file or several small ones, so they shouldn't make any difference. However, if your browser is configured to use a proxy, and the proxy can't handle large files properly, that could indeed be a problem.

...

It also seems like a good idea to split large files up using a file splitter (whichever one takes your fancy) as larger file downloads would seem to be problematic for most people who have access to networks with only a limited connection speed. It occurs to me that, given the randomness of this problem, this response might also be correspondingly random. Still, how long might it take to organise something in the way of a (perhaps unix script automated?) file splitting for the larger wikipedia database download files?

No, it wouldn't be difficult to do at all; the major issue, I'd assume, is that we'd have to store all the data twice if we wanted to provide both single file and split versions of the dumps. (Technically, it should be possible to write a PHP script or something to deliver individual chunks from a single large file, but that'd have its own complications.) Anyway, if the problem is that the download gets interrupted half way through, what you really want to do is use a download client (such as wget -c) that knows how to resume interrupted downloads from where they left off. Latest versions of Firefox apparently do have some limited support for that, but I'm not sure if there's any way to get Firefox to resume a download once it's decided it's failed.

...

PS – If it were ever the case that bit torrent were used for the dissemination of large files (there has been some mention of this on the wikipedia database download talk page), I can still imagine that there might be problems with trying to propagate the WHOLE of such a large file (~14GB) – though this assertion might run contrary to other peoples experiences.

Given that people routine use BitTorrent to download several dozen gigabyte movie files, I don't think it should have any problem with a mere 14 GiB database dump.

...

Anyhow, it occurs to me that, for the interests of redundancy, it would be worthwhile to figure out whether there's a way of changing the structure of the wikipedia database download so that, even if only the first 1GB of the database were downloaded, it would still be possible to read the information on it (perhaps this is already the case – but, from what I gather, once an incomplete database dump is downloaded – it is pretty useless, unless someone can correct me).

Actually, a truncated database dump should be perfectly usable, it just won't have all the data on it. Indeed, for some purposes, even a piece from the middle of the dump file can be used to extract useful data, although many standard tools won't be able to decompress and parse it. -- Ilmari Karonen

Domas Mituzas

14 Apr 14 Apr

7:42 a.m.

New subject: Dealing with Large Files when attempting a wikipedia database download.

Hello,

...

I could be wrong, but all it may take is one server (for whatever reason) deciding that the download is problematic for the whole file download to fail.

Our download servers support resume.

...

5) Is there some type of timeout command lying somewhere which might instruct the wikipedia server to quit a particular attempt to download a large file if it is taking too long?

No.

...

Our download servers support range requests, which are used by proper download clients to resume the downloads. Every modern HTTP client should support download resume and large files - people are not running fat16 anymore either (you know, that doesn't support >2GB either), why would network tools and delivery be as ancient?

...

It occurs to me that, given the randomness of this problem, this response might also be correspondingly random. Still, how long might it take to organise something in the way of a (perhaps unix script automated?) file splitting for the larger wikipedia database download files?

There is no need - we're using standards released 10 years ago to do the work properly.

...

already the case – but, from what I gather, once an incomplete database dump is downloaded – it is pretty useless, unless someone can correct me).

Use HTTP resume functionality: wget --continue curl --continue-at BR, -- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

5492

days inactive

5497

days old

wikitech-l@lists.wikimedia.org

Manage subscription

10 comments

10 participants

tags (0)

participants (10)

Bilal Abdul Kader
Brian
Daniel Kinzler
David Gerard
Domas Mituzas
Finne Boonen
Ilmari Karonen
Jameson Scanlon
Petr Kadlec
Platonides