In general we don't recombine the pieces; it is extremely easy for the enduser to do so if a single file is really needed. I probably have a shell (bash) script around here that would do it. But people have expressed a preference for more smaller files, either so that they can process a pice that contains the pages they like, or so that they can process the data in parallel.
Which brings up a point: a few months back I mentioned that I'd like to produce a large number, ~ 125, small files for the en wikipedia history dumps, rather than the 30 larger ones we produce now. . These files would have the first and last page id of their contents embedded in the filename. Once again I would not plan to recombine these files; it adds extra days to the run after the data has already been made available for download. I'd like people's comments on this.
Ariel
Στις 03-08-2011, ημέρα Τετ, και ώρα 18:12 +0200, ο/η Oliver Ferschke έγραψε:
Glad I could help. And YES, we would love to have volunteer contributions to the JWPL documentation. Any help is greatly appreciated. We also try to improve the documentation, but there is not always the time.
Thanks, Oliver
-----Ursprüngliche Nachricht----- Von: xmldatadumps-l-bounces@lists.wikimedia.org [mailto:xmldatadumps-l-bounces@lists.wikimedia.org] Im Auftrag von Napolitano, Diane Gesendet: Mittwoch, 3. August 2011 17:58 An: xmldatadumps-l@lists.wikimedia.org Betreff: Re: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history
Hi Oliver, thanks for your response. That answers my question and in that case, the 27 individual files (!) will work just fine.
On a side note, would you welcome any volunteer effort for documentation contributions to JWPL? ;)
Thanks, Diane
-----Original Message----- From: xmldatadumps-l-bounces@lists.wikimedia.org [mailto:xmldatadumps-l-bounces@lists.wikimedia.org] On Behalf Of Oliver Ferschke Sent: Wednesday, August 03, 2011 11:56 AM To: xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history
Dear Diane, I cannot give you an answer on your original question, but maybe I can still help. For what exactly do you need the data?
For the JWPL DataMachine, you won't need the pages-meta-history files - only meta-current, which is available as a single file. For the RevisionMachine, you can define multiple input files. Consequently, there is no problem using the archives without recombining them.
Only in the case you want to recreate an old/historic dump (or a series of old dumps) from the current history dump using the TimeMachine, will you need the pages-meta-history files recombined. Is this the case?
Best, Oliver
--
Oliver Ferschke, M.A. Doctoral Researcher Ubiquitous Knowledge Processing Lab FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 ferschke@tk.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-----Ursprüngliche Nachricht----- Von: xmldatadumps-l-bounces@lists.wikimedia.org [mailto:xmldatadumps-l-bounces@lists.wikimedia.org] Im Auftrag von Napolitano, Diane Gesendet: Mittwoch, 3. August 2011 17:36 An: xmldatadumps-l@lists.wikimedia.org Betreff: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history
Hello, are there any plans to combine all of the pages-meta-history XML dumps from the 7/22 dump into one file? This is useful for importing into JWPL.
Thanks,
Diane M. Napolitano Associate Research Engineer Educational Testing Service Turnbull Hall R-239 Princeton, New Jersey 08540
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l