Jameson Scanlon wrote:
I should state some of the following items of info in
response to the
email correspondence received :
1) Windows version information (I am not providing the full 'winver'
response obtained, because its probably not necessary – all that I
imagine that you'd need to know is the approximate windows OS upon
which I am attempting to download the relevant information).
Microsoft (R) Windows
Version 5.1 (Build 2600.xpsp_sp3_gdr.080814-1236 : Service Pack 3)
Copyright (C) 2007 Microsoft Corporation
The information that might actually be relevant is whether the disk
you're trying to download the dump to is using FAT or NTFS. FAT32 only
supports files up to 4 GiB, while NTFS should be able to handle larger
files.
I should have stated in my original statement that
sometimes it is
possible for me to download more than 4GB, but that (for some reason
or other) the download cuts out (dunno why).
Well, if so, that does kind of suggest that it's not the file system
that's the problem.
3) As a separate point, it occurs to me that one of
the reasons for
why the download might cut out is that there are a sequence of servers
(according to tracert) upon which I rely for the download to proceed.
I could be wrong, but all it may take is one server (for whatever
reason) deciding that the download is problematic for the whole file
download to fail.
The servers listed by tracert are only passing IP data packets between
your computer and Wikimedia's server. They don't know or care if you're
downloading one big file or several small ones, so they shouldn't make
any difference.
However, if your browser is configured to use a proxy, and the proxy
can't handle large files properly, that could indeed be a problem.
It also seems like a good idea to split large files
up using a file
splitter (whichever one takes your fancy) as larger file downloads
would seem to be problematic for most people who have access to
networks with only a limited connection speed.
It occurs to me that, given the randomness of this problem, this
response might also be correspondingly random. Still, how long might
it take to organise something in the way of a (perhaps unix script
automated?) file splitting for the larger wikipedia database download
files?
No, it wouldn't be difficult to do at all; the major issue, I'd assume,
is that we'd have to store all the data twice if we wanted to provide
both single file and split versions of the dumps.
(Technically, it should be possible to write a PHP script or something
to deliver individual chunks from a single large file, but that'd have
its own complications.)
Anyway, if the problem is that the download gets interrupted half way
through, what you really want to do is use a download client (such as
wget -c) that knows how to resume interrupted downloads from where they
left off. Latest versions of Firefox apparently do have some limited
support for that, but I'm not sure if there's any way to get Firefox to
resume a download once it's decided it's failed.
PS – If it were ever the case that bit torrent were
used for the
dissemination of large files (there has been some mention of this on
the wikipedia database download talk page), I can still imagine that
there might be problems with trying to propagate the WHOLE of such a
large file (~14GB) – though this assertion might run contrary to other
peoples experiences.
Given that people routine use BitTorrent to download several dozen
gigabyte movie files, I don't think it should have any problem with a
mere 14 GiB database dump.
Anyhow, it occurs to me that, for the interests
of redundancy, it would be worthwhile to figure out whether there's a
way of changing the structure of the wikipedia database download so
that, even if only the first 1GB of the database were downloaded, it
would still be possible to read the information on it (perhaps this is
already the case – but, from what I gather, once an incomplete
database dump is downloaded – it is pretty useless, unless someone can
correct me).
Actually, a truncated database dump should be perfectly usable, it just
won't have all the data on it. Indeed, for some purposes, even a piece
from the middle of the dump file can be used to extract useful data,
although many standard tools won't be able to decompress and parse it.
--
Ilmari Karonen