Xmldatadumps-l March 2013

xmldatadumps-l@lists.wikimedia.org

15 participants
15 discussions

pbzip2 proposal
by Richard Jelinek 16 Jan '16

16 Jan '16

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

6 11

[Fwd: Re: possible gsoc idea, comments?]
by Ariel T. Glenn 10 May '13

10 May '13

Ok, my 'reply all' is failing me in this mail user agent. Anyways, third time's a charm...

3 8

Is Chinese Variants dump available
by Jiang BIAN 16 Apr '13

16 Apr '13

Hi, Chinese Wikipedia supports a few variants, zh-cn, zh-tw, zh-hk, same wikitext is rendered differently under these variants. e.g. "software" in zh-cn [1] and "software" in zh-tw [2]. But seems no HTML are included in dump file zhwiki. Do you know where can I get the HTML version of articles on Chinese Wikipedia? Thanks [1] http://zh.wikipedia.org/zh-cn/%E8%BD%AF%E4%BB%B6 [2] http://zh.wikipedia.org/zh-tw/%E8%BD%AF%E4%BB%B6 -- Jiang BIAN This email may be confidential or privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it went to the wrong person. Thanks.

4 9

Housekeeping categories?
by Robert Crowe 31 Mar '13

31 Mar '13

Is there any way to distinguish between categories like History, or Literature for example, and what I would think of as categories that are used for internal housekeeping like "Unprintworthy_redirects" or "Nonindexed_pages"? They're not hidden categories, but conceptually there is a clear difference between housekeeping categories and categories that define fields of knowledge. But is there anything in the tables that distinguishes them? Thanks, Robert

4 9

Missing data in
by Giovanni Luca Ciampaglia 27 Mar '13

27 Mar '13

Hi all, I am wondering if the NOTE on the manual page of the redirect table [1] still applies and, if this is the case, how many data are missing and what would be the best way to incorporate the pagelinks table, since it seems to lack the information about whether a link is a redirect or not. [1] https://www.mediawiki.org/wiki/Manual:Redirect_table Cheers -- Giovanni Luca Ciampaglia ☞ http://www.inf.usi.ch/phd/ciampaglia/ ✆ (812) 287-3471 ✉ glciampagl(a)gmail.com

2 3

Processing french dump
by Benoit Lelong 27 Mar '13

27 Mar '13

Hi all, I am currently planning to process the last french dump. I would like to ask if somebody has already found or used a good OpenNLP french sentence detection model. If yes please let me know where to find one. Thanks in advance, Best regards, Benoit.

3 2

possible gsoc idea, comments?
by Ariel T. Glenn 26 Mar '13

26 Mar '13

So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation). See here: https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_du… What do folks think? Workable? Nuts? Low priority? Interested? Ariel

2 2

Embedded malware in media
by Kevin Day 21 Mar '13

21 Mar '13

We've once again been notified that our mirror of the Wikimedia images is "hosting malware". A quick check appears to mostly be more newly uploaded PDFs with one or more exploits in them, but there are also a few other media types that seem to be similarly damaged. I'm personally okay with ignoring it, it's not hurting us any, but ideally I'd like to see things like this get removed. Many of the infected PDFs appear to be arabic language documents that would be of interest to people critical of their government, so the implications of what's going on here are probably bigger than just random viruses getting added to files. I'm happy to scan everything again and post a list of things. I'm also willing to automate this if it would help (periodic scans and uploading a list of all questionable images to a wiki page somewhere?) Anyone have any suggestions on what to do here? -- Kevin

4 5

de wikipedia dumps in progress
by Ariel T. Glenn 20 Mar '13

20 Mar '13

Folks will have noticed that the de wikipedia dumps failed after getting about 2/3 of the way through the meta history dump step (100GB written). I'm in the process of setting up for the completion of that job, it will take a few days and there won't be a progress report visible on the regular html page. By the way, the cause of the breakage was that the database server at the other end of the connection went away in the middle of the run, and no new connection to a server could be obtained, for long enough that the program ran out of retries and gave up. Ariel

1 0

more import-related stuff
by Ariel T. Glenn 18 Mar '13

18 Mar '13

In my continued quest to Make Imports Suck Less (tm), I've written a little perl script to shovel data from a tab-delimited escaped file to a fifo in pieces while forking off mysql to LOAD DATA INFILE from the fifo for each chunk. It's only been tested on linux, specifically my laptop, but I did run it using current article content dumps and all the auxiliary tables for a wiki of a few hundred thousand articles, and it worked ok. You can find it in the xmlfileutils directory of my branch of the git dumps repo: https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmlfi… You'll notice that all my tools are linux + mysql, and that's because that is what I use. If folks want similar tools for other platforms they'll have to write them, I don't have the expertise for that. Ah also the docs on Meta about dumps have been reorganized and rewritten, not that they are either error-free or complete but they should be in much better shape now: http://meta.wikimedia.org/wiki/Data_dumps And lastly, the uncompressed en wp meta history dumps are now over 10T. Yay? As always, feedback, edits, patches welcome. Ariel P.S. Sorry Platonides but if you were going to rework a script of yours you were too slow ;-) (However if you have such a script with different/better features I'll still take it.)

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l March 2013