Hi Brian, Brion once explained to me that the post processing of the dump is the main bottleneck.
Compressing articles with tens of thousands of revisions is a major resource drain. Right now every dump is even compressed twice, into bzip2 (for wider platform compatibility) and 7zip format (for 20 times smaller downloads). This may no longer be needed as 7zip presumably gained better support on major platforms over the years. Apart from that the job could gain from parallelization and better error recovery.
Erik Zachte
________________________________________
I am still quite shocked at the amount of time the english wikipedia takes to dump, especially since we seem to have close links to folks who work at mysql. To me it seems that one of two things must be the case:
1. Wikipedia has outgrown mysql, in the sense that, while we can put data in, we cannot get it all back out. 2. Despite aggressive hardware purchases over the years, the correct hardware has still not been purchased.
I wonder which of these is the case. Presumably #2 ?
Cheers, Brian
Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward:
1. Offer an uncompressed version of the dump for download. Bandwidth is cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a service whereby the mail the uncompressed dump to you on a hard drive. You pay for the drive and a service charge.
Cheers,
On Wed, Dec 24, 2008 at 5:03 PM, Erik Zachte erikzachte@infodisiac.comwrote:
Hi Brian, Brion once explained to me that the post processing of the dump is the main bottleneck.
Compressing articles with tens of thousands of revisions is a major resource drain. Right now every dump is even compressed twice, into bzip2 (for wider platform compatibility) and 7zip format (for 20 times smaller downloads). This may no longer be needed as 7zip presumably gained better support on major platforms over the years. Apart from that the job could gain from parallelization and better error recovery.
Erik Zachte
I am still quite shocked at the amount of time the english wikipedia takes to dump, especially since we seem to have close links to folks who work at mysql. To me it seems that one of two things must be the case:
- Wikipedia has outgrown mysql, in the sense that, while we can put data
in, we cannot get it all back out. 2. Despite aggressive hardware purchases over the years, the correct hardware has still not been purchased.
I wonder which of these is the case. Presumably #2 ?
Cheers, Brian
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Also, I wonder if these folks have been consulted for their expertise in compressing wikipedia data: http://prize.hutter1.net/
On Wed, Dec 24, 2008 at 5:09 PM, Brian Brian.Mingus@colorado.edu wrote:
Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward:
- Offer an uncompressed version of the dump for download. Bandwidth is
cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a service whereby the mail the uncompressed dump to you on a hard drive. You pay for the drive and a service charge.
Cheers,
On Wed, Dec 24, 2008 at 5:03 PM, Erik Zachte erikzachte@infodisiac.comwrote:
Hi Brian, Brion once explained to me that the post processing of the dump is the main bottleneck.
Compressing articles with tens of thousands of revisions is a major resource drain. Right now every dump is even compressed twice, into bzip2 (for wider platform compatibility) and 7zip format (for 20 times smaller downloads). This may no longer be needed as 7zip presumably gained better support on major platforms over the years. Apart from that the job could gain from parallelization and better error recovery.
Erik Zachte
I am still quite shocked at the amount of time the english wikipedia takes to dump, especially since we seem to have close links to folks who work at mysql. To me it seems that one of two things must be the case:
- Wikipedia has outgrown mysql, in the sense that, while we can put data
in, we cannot get it all back out. 2. Despite aggressive hardware purchases over the years, the correct hardware has still not been purchased.
I wonder which of these is the case. Presumably #2 ?
Cheers, Brian
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- (Not sent from my iPhone)
On Wed, Dec 24, 2008 at 4:09 PM, Brian Brian.Mingus@colorado.edu wrote:
Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward:
- Offer an uncompressed version of the dump for download. Bandwidth is
cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a service whereby the mail the uncompressed dump to you on a hard drive. You pay for the drive and a service charge.
I would estimate a complete, uncompressed enwiki dump in the present format at ~3 TB in size. ruwiki, which has about 5% as many revisions as enwiki, has a 187 GB uncompressed dump.
At 3 TB, virtually any mechanism of distributing an uncompressed dump would be very problematic.
7zip currently achieves greater than 99% size reduction.
-Robert Rohde
Hi Robert,
I'm not sure I agree with you..
(3 terabytes / 10 megabytes) seconds in days = 3.64 days
That is, on my university connection I could download the dump in just a few days. The only cost is bandwidth.
On Wed, Dec 24, 2008 at 6:46 PM, Robert Rohde rarohde@gmail.com wrote:
On Wed, Dec 24, 2008 at 4:09 PM, Brian Brian.Mingus@colorado.edu wrote:
Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward:
- Offer an uncompressed version of the dump for download. Bandwidth is
cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a service whereby the mail the uncompressed dump to you
on
a hard drive. You pay for the drive and a service charge.
I would estimate a complete, uncompressed enwiki dump in the present format at ~3 TB in size. ruwiki, which has about 5% as many revisions as enwiki, has a 187 GB uncompressed dump.
At 3 TB, virtually any mechanism of distributing an uncompressed dump would be very problematic.
7zip currently achieves greater than 99% size reduction.
-Robert Rohde
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Dec 24, 2008 at 6:05 PM, Brian Brian.Mingus@colorado.edu wrote:
Hi Robert,
I'm not sure I agree with you..
(3 terabytes / 10 megabytes) seconds in days = 3.64 days
That is, on my university connection I could download the dump in just a few days. The only cost is bandwidth.
While you might be correct, most connections are reported as megaBITS per second. For example, AT&T's highest grade of residential DSL service is 6 Mbps, which would result in 46 day download. Comcast goes up to 16 Mbps, which is 17 days.
-Robert Rohde
But at least this would allow Erik, researchers and archivers to get the dump faster than they can get the compressed version. The number of people who want this can't be > 100, can it? It would need to be metered by an API I guess.
Cheers, Brian
On Wed, Dec 24, 2008 at 7:18 PM, Robert Rohde rarohde@gmail.com wrote:
On Wed, Dec 24, 2008 at 6:05 PM, Brian Brian.Mingus@colorado.edu wrote:
Hi Robert,
I'm not sure I agree with you..
(3 terabytes / 10 megabytes) seconds in days = 3.64 days
That is, on my university connection I could download the dump in just a
few
days. The only cost is bandwidth.
While you might be correct, most connections are reported as megaBITS per second. For example, AT&T's highest grade of residential DSL service is 6 Mbps, which would result in 46 day download. Comcast goes up to 16 Mbps, which is 17 days.
-Robert Rohde
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I'm also curious, what is the estimated amount of time to decompress this thing?
On Wed, Dec 24, 2008 at 7:24 PM, Brian Brian.Mingus@colorado.edu wrote:
But at least this would allow Erik, researchers and archivers to get the dump faster than they can get the compressed version. The number of people who want this can't be > 100, can it? It would need to be metered by an API I guess.
Cheers, Brian
On Wed, Dec 24, 2008 at 7:18 PM, Robert Rohde rarohde@gmail.com wrote:
On Wed, Dec 24, 2008 at 6:05 PM, Brian Brian.Mingus@colorado.edu wrote:
Hi Robert,
I'm not sure I agree with you..
(3 terabytes / 10 megabytes) seconds in days = 3.64 days
That is, on my university connection I could download the dump in just a
few
days. The only cost is bandwidth.
While you might be correct, most connections are reported as megaBITS per second. For example, AT&T's highest grade of residential DSL service is 6 Mbps, which would result in 46 day download. Comcast goes up to 16 Mbps, which is 17 days.
-Robert Rohde
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
-- (Not sent from my iPhone)
On Wed, Dec 24, 2008 at 6:29 PM, Brian Brian.Mingus@colorado.edu wrote:
I'm also curious, what is the estimated amount of time to decompress this thing?
Somewhere around 1 week, I'd guesstimate.
-Robert Rohde
2008/12/25 Brian Brian.Mingus@colorado.edu:
But at least this would allow Erik, researchers and archivers to get the dump faster than they can get the compressed version. The number of people who want this can't be > 100, can it? It would need to be metered by an API I guess.
Maybe we can run a sneakernet of DLTs. The Florida sysadmins run off a stack of tapes, they send those to someone to run off copies of and distribute to the next layer, and so on ...
- d.
2008/12/25 David Gerard dgerard@gmail.com:
2008/12/25 Brian Brian.Mingus@colorado.edu:
But at least this would allow Erik, researchers and archivers to get the dump faster than they can get the compressed version. The number of people who want this can't be > 100, can it? It would need to be metered by an API I guess.
Maybe we can run a sneakernet of DLTs. The Florida sysadmins run off a stack of tapes, they send those to someone to run off copies of and distribute to the next layer, and so on ...
- d.
I'd more be thinking of handing over a stack of hard drives to wikimedia chapter reps at wikimania .
2008/12/25 geni geniice@gmail.com:
I'd more be thinking of handing over a stack of hard drives to wikimedia chapter reps at wikimania .
2TB external hard disk, gzip on the fly (gzipping is faster than the network - remember, Wikimedia gzips data going between internal servers in the same rack because CPU is cheaper than network!) - USB 2.0 is 480Mbit/sec, that's 60MB/sec, that's a gzipped dump at about 9 hours 20 minutes a terabyte assuming a near-perfect USB interface. Long-winded, but all we need is to custom-build a hard disk duplicator for I/O efficiency. A simple matter of hardware design. Then it should be no more inherently painful than duplicating VHS video tapes.
- d.
On Wed, Dec 24, 2008 at 7:09 PM, Brian Brian.Mingus@colorado.edu wrote:
Interesting. I realize that the dump is extremely large, but if 7zip is really the bottleneck then to me the solutions are straightforward:
- Offer an uncompressed version of the dump for download. Bandwidth is
cheap and downloads can be resumed, unlike this dump process 2. The WMF offers a service whereby the mail the uncompressed dump to you on a hard drive. You pay for the drive and a service charge.
I'm pretty sure that both of those would be less straightforward than actually fixing the dump process properly. That something isn't getting done doesn't necessarily mean that new plans for fixing it are needed -- it could be that the current plan is best but needs more resources.
On Wed, Dec 24, 2008 at 9:46 PM, David Gerard dgerard@gmail.com wrote:
Long-winded, but all we need is to custom-build a hard disk duplicator for I/O efficiency.
Such things already exist. They're called "hot-swappable RAID 1".
2008/12/25 Erik Zachte erikzachte@infodisiac.com:
Hi Brian, Brion once explained to me that the post processing of the dump is the main bottleneck. Compressing articles with tens of thousands of revisions is a major resource drain. Right now every dump is even compressed twice, into bzip2 (for wider platform compatibility) and 7zip format (for 20 times smaller downloads). This may no longer be needed as 7zip presumably gained better support on major platforms over the years. Apart from that the job could gain from parallelization and better error recovery.
7zip is readily available as free software for Unixlike platforms, though it's pretty much never installed by default.
- d.
wikimedia-l@lists.wikimedia.org