Xmldatadumps-l September 2013

xmldatadumps-l@lists.wikimedia.org

13 participants
9 discussions

pbzip2 proposal
by Richard Jelinek 16 Jan '16

16 Jan '16

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

6 11

failed dump jobs, don't panic
by Ariel T. Glenn 30 Sep '13

30 Sep '13

We were shuffling db passwords around this morning, and that change has not made it out completely to the dump hosts. I'm taking the opportunity to clean up the way that configuration is handled in the scripts and we'll be back in business a little bit later today. Ariel

1 0

Re: [Xmldatadumps-l] [Wikitech-l] Bulk download
by Jeremy Baron 23 Sep '13

23 Sep '13

On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu(a)skobbler.com> wrote: > I have a list of about 1.8 million images which I have to download from commons.wikimedia.org. Is there any simple way to do this which doesn't involve an individual HTTP hit for each image? You mean full size originals, not thumbs scaled to a certain size, right? You should rsync from a mirror[0] (rsync allows specifying a list of files to copy) and then fill in the missing images from upload.wikimedia.org ; for upload.wikimedia.org I'd say you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate UA string with contact info (email address) so ops can contact you if there's an issue. At the moment there's only one mirror and it's ~6-12 months out of date so there may be a substantial amount to fill in. And of course you should be getting checksums from somewhere (the API?) and verifying them. If your images are all missing from the mirror than it should take around 40 days at 0.5 img/sec but I guess you probably could do it in less than 10 days if you have a fast enough pipe. (depends on if you get a lit of misses or hits) See also [1] but not all of that applies because upload.wikimedia.org isn't MediaWiki. so e.g. no maxlag param. -Jeremy [0] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media [1] https://www.mediawiki.org/wiki/API:Etiquette

3 2

fragments of page missing in the French dump
by Gil Francopoulo 17 Sep '13

17 Sep '13

Dear all, I'd like to extract information from the infobox and textual definition, but I realize that some entries are not fully contained in the French dump, compared with the interactive pages. For instance, the entry "Los Angeles" (the city in California and not the one in Chili) is incomplete. Only the first two paragraphs are there. The infobox and the rest of the page are missing. I noticed a call to the "q" template like this: {{q|Los Angeles}} which states that this entry is an entry of good quality. My questions are: * do you think that it is a bug? * is there a relation between the "q" call and the fact that the entry is partial? * where is it possible to get the infobox for Los Angeles? Merci d'avance, Gil Francopoulo Tagmatica/Spotter/CNRS

2 6

relationship between logging and page_restrictions
by Xavier Vinyals Mirabent 16 Sep '13

16 Sep '13

Hi, Are the values in the columns pr_id and log_id equivalent? I'm trying to select all changes in editing protection status for Wikipedia articles but the table Page_restrictions doesn't contain a time stamp, and the table logging doesn't specify the kind of protection so I'm trying to join them somehow... Thanks! -- Xavi

3 5

First preview version of incremental dumps
by Petr Onderka 05 Sep '13

05 Sep '13

Hi, after a month of work on my GSoC project Incremental Dumps [1], I think I have now something worth sharing and talking about, though it's still far from complete. What the code can do now is to read a pages-history XML dump and create the various kinds of dumps (pages/stub, current/history) in the new format from that. It can then convert a dump in the new format back to XML. The XML output is almost the same as existing XML dumps, but there are some differences [2]. The current state of the new format also now has a detailed specification [3] (this describes the current version, the format is still in flux and can change daily). If you want, you can also try running the code. [4] It's not production-quality yet (e.g. it doesn't report errors properly), but it should work. Compilation instructions are in the README file. Any comments or questions are welcome. Petr Onderka User:Svick [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps [2]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/XML_… [3]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/Spec… [4]: https://github.com/wikimedia/operations-dumps-incremental/tree/gsoc

2 3

Tabs dumps
by Giovanni Luca Ciampaglia 05 Sep '13

05 Sep '13

Hi all, Back in March I remember there was some experimentation with releasing dumps in TSV format. I downloaded a bunch of files and just recently imported the pagelinks table without any problem. Are there any future plans to continue with releasing TSV dumps? Best, -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciampag(a)indiana.edu

2 1

Wikipedia XML article dump from 2011-07-22
by Mohamed Yahya 05 Sep '13

05 Sep '13

I need the Wikipedia dump from 2011-07-22 (from which DBpedia 3.7 was extracted). It is no longer available from the official Wikipedia dumps page. Can you please point me to a place to download it from. Preferable non-torrent version. Thanks, Mohamed

4 3

What happened?
by Byrial Jensen 03 Sep '13

03 Sep '13

Hi, It seems like all dump processes failed and stopped some time yesterday. What happened, and is there any prognosis for when the dumping will resume? Regards, - Byrial

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l September 2013