Hi, I noticed on http://download.wikimedia.org/enwiki/20070402/ that the ETA for the next history dump is May 19 and I have no reason to suspect this is wrong. Once the bz2 dump is finished, the 7zip dump will begin, and this is usually much faster. Last time, it took about ten days to complete after the bz2 dump had finished.
Is the 7zip dump generated from the bz2 dump or from the database? If it's generated from the database, I'd propose generating the 7z dump first, and then the bz2. That way, we'd all have new English data to play with a whole month earlier.
Thanks, - Dan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Dan Vanderkam wrote:
Hi, I noticed on http://download.wikimedia.org/enwiki/20070402/ that the ETA for the next history dump is May 19 and I have no reason to suspect this is wrong. Once the bz2 dump is finished, the 7zip dump will begin, and this is usually much faster. Last time, it took about ten days to complete after the bz2 dump had finished.
Is the 7zip dump generated from the bz2 dump or from the database?
- From the bz2 dump.
7zip is much slower to compress than bz2 (and the bz2 compression is additionally parallelized). The primary expense atm is pulling everything out of the db, particularly as the system realigns itself it currently has to grab a fresh copy of everything, and some things may be slower than they need to be, and some things will also take longer on older items than on newer, so the estimate may not be accurate.
- -- brion vibber (brion @ wikimedia.org)
That's interesting that this process is input-bound. I wouldn't have expected that. I've been monitoring the bz2 dump with
curl -I http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-meta-his...
It's grown by 5GB in the past day, which would imply it reach ~90GB in 15 days, well before the listed ETA of 5-19. Here's hoping...
- Dan
On 4/11/07, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Dan Vanderkam wrote:
Hi, I noticed on http://download.wikimedia.org/enwiki/20070402/ that the ETA for the next history dump is May 19 and I have no reason to suspect this is wrong. Once the bz2 dump is finished, the 7zip dump will begin, and this is usually much faster. Last time, it took about ten days to complete after the bz2 dump had finished.
Is the 7zip dump generated from the bz2 dump or from the database?
- From the bz2 dump.
7zip is much slower to compress than bz2 (and the bz2 compression is additionally parallelized). The primary expense atm is pulling everything out of the db, particularly as the system realigns itself it currently has to grab a fresh copy of everything, and some things may be slower than they need to be, and some things will also take longer on older items than on newer, so the estimate may not be accurate.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGHbTPwRnhpk1wk44RAuszAJ9k56pQFNoicWbQ2Qn1wCwVNKRm8QCeKoNt qi0AH/jdfol9wegVgWlWtgU= =A4wJ -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org