[Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
Felipe Ortega
glimmer_phoenix at yahoo.es
Wed Mar 17 23:50:04 UTC 2010
--- El mar, 16/3/10, Kevin Webb <kpwebb at gmail.com> escribió:
> De: Kevin Webb <kpwebb at gmail.com>
> Asunto: Re: [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
> Para: "Tomasz Finc" <tfinc at wikimedia.org>
> CC: "Wikimedia developers" <wikitech-l at lists.wikimedia.org>, xmldatadumps-admin-l at lists.wikimedia.org, Xmldatadumps-l at lists.wikimedia.org
> Fecha: martes, 16 de marzo, 2010 21:10
> I just managed to finish
> decompression. That took about 54 hours on an
> EC2 2.5x unit CPU. The final data size is 5469GB.
>
> As the process just finished I haven't been able to check
> the
> integrity of the XML, however, the bzip stream itself
> appears to be
> good.
>
> As was mentioned previously, it would be great if you could
> compress
> future archives using pbzib to allow for parallel
> decompression. As I
> understand it, the pbzip files are reverse compatible with
> all
> existing bzip2 utilities.
>
Yes, they're :-).
Regards,
F.
> Thanks again for all your work on this!
> Kevin
>
>
> On Tue, Mar 16, 2010 at 4:05 PM, Tomasz Finc <tfinc at wikimedia.org>
> wrote:
> > Tomasz Finc wrote:
> >> New full history en wiki snapshot is hot off the
> presses!
> >>
> >> It's currently being checksummed which will take a
> while for 280GB+ of
> >> compressed data but for those brave souls willing
> to test please grab it
> >> from
> >>
> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
> >>
> >> and give us feedback about its quality. This run
> took just over a month
> >> and gained a huge speed up after Tims work on
> re-compressing ES. If we
> >> see no hiccups with this data snapshot, I'll start
> mirroring it to other
> >> locations (internet archive, amazon public data
> sets, etc).
> >>
> >> For those not familiar, the last successful run
> that we've seen of this
> >> data goes all the way back to 2008-10-03. That's
> over 1.5 years of
> >> people waiting to get access to these data bits.
> >>
> >> I'm excited to say that we seem to have it :)
> >
> > So now that we've had it for a couple of days .. can I
> get a status
> > report from someone about its quality?
> >
> > Even if you had no issues please let us know so that
> we start mirroring.
> >
> > --tomasz
> >
> > _______________________________________________
> > Xmldatadumps-admin-l mailing list
> > Xmldatadumps-admin-l at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
> >
>
> _______________________________________________
> Xmldatadumps-admin-l mailing list
> Xmldatadumps-admin-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
More information about the Xmldatadumps-l
mailing list