Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is approximately 5.xGB. However, all versions I tried failed at about 1.5GB. I noticed someone posted a Python code online as a workaround, but it did not work for me.
My machine is Windows Server 2008 64 bit. Any idea how to get this huge file?
Appreciated. Rob
Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is approximately 5.xGB. However, all versions I tried failed at about 1.5GB. I noticed someone posted a Python code online as a workaround, but it did not work for me.
My machine is Windows Server 2008 64 bit. Any idea how to get this huge file?
Appreciated. Rob
On Tue, Dec 15, 2009 at 12:25 PM, Rob Giberson ajaxdwr@gmail.com wrote:
Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is approximately 5.xGB. However, all versions I tried failed at about 1.5GB. I noticed someone posted a Python code online as a workaround, but it did not work for me.
My machine is Windows Server 2008 64 bit. Any idea how to get this huge file?
Appreciated. Rob
You wouldn't happen to have proxies between you and your internet connection would you?
-Peachey
You got that right. I am behind a company proxy. Any idea how to get this around?
On Mon, Dec 14, 2009 at 8:09 PM, K. Peachey p858snake@yahoo.com.au wrote:
On Tue, Dec 15, 2009 at 12:25 PM, Rob Giberson ajaxdwr@gmail.com wrote:
Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is
approximately
5.xGB. However, all versions I tried failed at about 1.5GB. I noticed someone posted a Python code online as a workaround, but it did not work
for
me.
My machine is Windows Server 2008 64 bit. Any idea how to get this huge file?
Appreciated. Rob
You wouldn't happen to have proxies between you and your internet connection would you?
-Peachey
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I failed several times and ended up downloading exactly the same number of bytes: 1,465,454KB (1.46GB), even for different versions of that file.
Is it possible that the files themselves are corrupted? Anyone successfully downloaded one of these big files recently?
Rob
On Mon, Dec 14, 2009 at 8:11 PM, Rob Giberson ajaxdwr@gmail.com wrote:
You got that right. I am behind a company proxy. Any idea how to get this around?
On Mon, Dec 14, 2009 at 8:09 PM, K. Peachey p858snake@yahoo.com.auwrote:
On Tue, Dec 15, 2009 at 12:25 PM, Rob Giberson ajaxdwr@gmail.com wrote:
Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is
approximately
5.xGB. However, all versions I tried failed at about 1.5GB. I noticed someone posted a Python code online as a workaround, but it did not work
for
me.
My machine is Windows Server 2008 64 bit. Any idea how to get this huge file?
Appreciated. Rob
You wouldn't happen to have proxies between you and your internet connection would you?
-Peachey
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
This is likely your company proxy not supporting huge downloads. Might be its cache partition got filled up.
Marco
2009/12/15 Rob Giberson ajaxdwr@gmail.com
I failed several times and ended up downloading exactly the same number of bytes: 1,465,454KB (1.46GB), even for different versions of that file.
Is it possible that the files themselves are corrupted? Anyone successfully downloaded one of these big files recently?
Rob
On Mon, Dec 14, 2009 at 8:11 PM, Rob Giberson ajaxdwr@gmail.com wrote:
You got that right. I am behind a company proxy. Any idea how to get this around?
On Mon, Dec 14, 2009 at 8:09 PM, K. Peachey <p858snake@yahoo.com.au wrote:
On Tue, Dec 15, 2009 at 12:25 PM, Rob Giberson ajaxdwr@gmail.com wrote:
Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is
approximately
5.xGB. However, all versions I tried failed at about 1.5GB. I noticed someone posted a Python code online as a workaround, but it did not
work
for
me.
My machine is Windows Server 2008 64 bit. Any idea how to get this
huge
file?
Appreciated. Rob
You wouldn't happen to have proxies between you and your internet connection would you?
-Peachey
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
hmm... I was just able to download another file from the enwiki list which is 2.4GB without any problem. Thoughts?
On Tue, Dec 15, 2009 at 10:37 AM, Marco Schuster < marco@harddisk.is-a-geek.org> wrote:
This is likely your company proxy not supporting huge downloads. Might be its cache partition got filled up.
Marco
2009/12/15 Rob Giberson ajaxdwr@gmail.com
I failed several times and ended up downloading exactly the same number
of
bytes: 1,465,454KB (1.46GB), even for different versions of that file.
Is it possible that the files themselves are corrupted? Anyone
successfully
downloaded one of these big files recently?
Rob
On Mon, Dec 14, 2009 at 8:11 PM, Rob Giberson ajaxdwr@gmail.com wrote:
You got that right. I am behind a company proxy. Any idea how to get
this
around?
On Mon, Dec 14, 2009 at 8:09 PM, K. Peachey <p858snake@yahoo.com.au wrote:
On Tue, Dec 15, 2009 at 12:25 PM, Rob Giberson ajaxdwr@gmail.com wrote:
Guess this problem might be asked several (many) times before...but
I tried to download the pages-articles.xml.bz2 file which is
approximately
5.xGB. However, all versions I tried failed at about 1.5GB. I
noticed
someone posted a Python code online as a workaround, but it did not
work
for
me.
My machine is Windows Server 2008 64 bit. Any idea how to get this
huge
file?
Appreciated. Rob
You wouldn't happen to have proxies between you and your internet connection would you?
-Peachey
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Rob Giberson wrote:
hmm... I was just able to download another file from the enwiki list which is 2.4GB without any problem. Thoughts?
enwiki-20091128-pages-articles.xml.bz2 5795591893 bytes Store that in a long int variable of 4 bytes. Now, retrieve it: 1500624597. Which is ~1.5GB (1,39 GiB).
Your company proxy doesn't allow downloading files bigger than 4 GB. It will only take as size the remainder.
Perhaps you could fool it doing Range requests for the remainder (you can use wget -c ). Anyway, your company should get its proxy fixed.
On 12/15/2009 10:32 AM, Rob Giberson wrote:
I failed several times and ended up downloading exactly the same number of bytes: 1,465,454KB (1.46GB), even for different versions of that file.
Is it possible that the files themselves are corrupted? Anyone successfully downloaded one of these big files recently?
Just to confirm that this works fine, I started up an EC2 instance and successfully downloaded this file:
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml...
I ended up with 5.4 GB of data, and an MD5 checksum of 802c79045801bfb5f77fab3170af8efc. Just to be sure the file was uncorrupted, I ran it through bzcat and got a checksum of 5dd71caf9ed0b3387b351783325b788a.
William
wikitech-l@lists.wikimedia.org