Xmldatadumps-l September 2012

xmldatadumps-l@lists.wikimedia.org

12 participants
6 discussions

pbzip2 proposal
by Richard Jelinek 16 Jan '16

16 Jan '16

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

6 11

what's needed to fork
by Ariel T. Glenn 27 Sep '12

27 Sep '12

A discussion on the toolserver list brought up the question, once again, of what would be needed to fork the projects. Because some data is private we aren't going to be able to provide data for perfect copies, but the content can be preserved. The question is how close we can get. In particular I would like folks to think about how we can manage the user account issue. It would be very nice indeed if users could reclaim their accounts on a copy of the project, and yet we cannot provide any outside project a copy of the user table (which has email addresses and other useful bits in it). And many many users don't give an email address anyways. I'd like to hear proposals for how this could be handled. Wouldn't it be awesome if this could be done today, and Wikipedia editors could have editing privileges on copies of the project around the globe that provided different experimental features? Assuming of course that there were groups or organizations that wanted to run such copies of the site... Ariel

2 1

What is the best way to extract bzip2 and be able to parse the text
by Tubtim Eawchoowong 19 Sep '12

19 Sep '12

I am planning to decompress XML formatted bzip2 file by downloading the file using Java class URL. Then I plan to decompress it on the fly using Apache Ant and then parse to store in MySQL database. I am not sure if there are better ways to do that. Also is there a way to update my database without having to go through the whole process every time?

1 0

Re: [Xmldatadumps-l] [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates?
by Jeremy Baron 17 Sep '12

17 Sep '12

On Sun, Sep 9, 2012 at 6:34 PM, Roberto Flores <f.roberto.isc(a)gmail.com> wrote: > I have developed an offline Wikipedia, Wikibooks, Wiktionary, etc. app for > the iPhone, which does a somewhat decent job at interpreting the wiki > markup into HTML. > However, there are too many templates for me to program (not to mention, > it's a moving target). > Without converting these templates, many articles are simply unreadable and > useless. Templates are dumped just like all other pages are. Have you found them in the dumps? which dump are you looking at right now? > Could you please provide HTML dumps (I mean, with the templates > pre-processed into HTML, everything else the same as now) every 3 or 4 > months? 3 or 4 month frequency seems unlikely to be useful to many people. Otherwise no comment. > Or alternatively, could you make the template API available so I could > import it in my program? How would this template API function? What does import mean? -Jeremy

6 6

DEWIKI SQL dump, 3 queries
by Gregor Martynus 09 Sep '12

09 Sep '12

Hey there, can anybody with access execute these 3 SQL queries and provide me the results of it? --1 SELECT user_id, user_name, user_registration FROM user INNER JOIN logging ON log_user = user_id WHERE LEFT(user_registration, 4) = 2012 AND user_id NOT IN (SELECT ipb_user FROM ipblocks) AND log_type = 'newusers' AND log_action = 'create'; --2 SELECT page_id, page_title, page_namespace, page_is_redirect FROM page; --3 INSERT INTO u_hoo.dbq189 SELECT user_name FROM user INNER JOIN logging ON log_user = user_id WHERE LEFT(user_registration, 4) = 2012 AND user_id NOT IN (SELECT ipb_user FROM ipblocks) AND log_type = 'newusers' AND log_action = 'create'; SELECT rev_id, rev_page, rev_comment, rev_user_text, rev_user, rev_timestamp FROM revision INNER JOIN u_hoo.dbq189 ON rev_user_text = dbq189.user_name WHERE rev_deleted = 0 AND rev_user != 0; I need it for a study of a friend of mine, I really appreciate your help. I also made a ticket, but nobody reacted so far and it's a bit urgent: https://jira.toolserver.org/browse/DBQ-195 Thanks – Gregor

3 6

7zip on *-pages-articles
by Zak Wilcox 02 Sep '12

02 Sep '12

Hi dump producers, I know there's more to the choice of compression format than the size of the resulting dumps (e.g. time, memory, portability, existing code investment) and I read that you looked at LZMA and found it to be of insignificant benefit [1], but I noticed over at the Large Text Compression Benchmark site that they use 7-zip in PPMd mode and did some trial recompressions. The bzip dumps use 900k blocks and according to the bzip2.org implementation's manual it takes around 7600k while compressing and around 3700k while decompressing. Like LZMA, PPMd apparently uses the same amount of memory for decompression as it used during compression, so I recompressed the XML dump with various amounts of memory so you can make your own comparisons. Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran: for MEM in 3700k 7600k 16m 512m; do bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si -m0=PPMd:mem=$MEM \ enwiki-20120802-pages-articles.xml.$MEM.7z done bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z for the following resulting file sizes in bytes (% of .bz2 version): original bz2: 9143865996 $MEM=3700k : 8648303296 (94.6%) $MEM=7600k : 8043626528 (88.0%) $MEM=16m : 7910637814 (86.5%) (the default for both PPMd & LZMA) LZMA: 7705327210 (84.3%) $MEM=512m : 7076755355 (77.4%) I wasn't looking to compare running times and absolute values wouldn't compare to your servers but for what it's worth I noticed that LZMA took over twice as long as any PPMd run. I was expecting PPMd to beat LZMA, hence the several PPMd runs. There's probably some value in experimenting with PPMd's "model order" too, which I didn't try. Google "model order for PPMd" or see /usr/share/doc/p7zip-full/DOCS/MANUAL/switches/method.htm#PPMd (on Debian/Ubuntu). As the dump servers only have to do it once to save that bandwidth for every download from every mirror that month, perhaps it's worth giving 7zip more memory than bzip or even more than the default, although I appreciate that you drive some users out of the market if compression memory requirements equal decompression requirements and you start using a few gig to compress. Also while you can (with a little effort) seek around bz2s and extract individual blocks, PPMd's seekability isn't something I've explored. Just some thoughts. Zak [1]: "7-Zip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files." --- http://meta.wikimedia.org/wiki/Data_dumps

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l September 2012