don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Is there any way to distinguish between categories like History, or
Literature for example, and what I would think of as categories that are
used for internal housekeeping like "Unprintworthy_redirects" or
"Nonindexed_pages"? They're not hidden categories, but conceptually there
is a clear difference between housekeeping categories and categories that
define fields of knowledge. But is there anything in the tables that
I am currently planning to process the last french dump. I would like to
ask if somebody has already found or used a good OpenNLP french sentence
detection model. If yes please let me know where to find one.
Thanks in advance,
I have the files from the February run for en wikipedia converted here:
In the sqlfiles directory are the page, revision and text tables in sql
format for MediaWiki 1.20, and in the tabfiles directory are all the
tables needed or a mirror (I omitted images, oldimages and a couple
others) converted to tab delimted format for use with LOAD DATA INFILE
The contents may be garbage etc. etc. so be forewarned. Please check
them out and let me know how they are.
While I'll leave the files there for awhile, they won't be there
forever, so don't be surprised if in the future they disappear.
Missing is a script to write the tab delimted files to a fifo in
reasnably sized chunks. I am told that percona has something like this
if it turns out to make a difference in import speed/memory.
Don't forget to make sure your client and server character sets are set
up correctly, that you've disabled foreign key checks etc. etc. before
attemptgin to shovel the data in.
Forwarding in case there are folks on this list interested. What I
really want is something sourceforge-like: it knows which mirror has a
copy of the file and has the most bandwidth available for the user.
So partly due to recent work by folks like Kent on creating local WP
mirrors using the import process, and partly from helping walk someone
through the process for the zillionth time, I have come to the
realization that This Process Sucks (TM). I am not taking on the whole
stack, but I am trying to make a dent in at least part of it. To that
1) mwdumper available from download.wikimedia.org is now the current
version and should run without a bunch of fancy tricks. Thanks to Chad
for fixing up the jenkins build. I tried it on a recent en wikipedia
current pages dump and it seemed to work. though I did not test
importing the output.
2) I have a couple of tools for *nix users importing into a MySQL
* Somewhat equivalent to mwdumper is 'mwxml2sql', name chosen before I
saw that there was a long abandonded xml2sql tool available in the wild.
Input: stubs and page content xml files, output: sql files for each of
page, revision, text table, reading 0.4 xsd through 0.7 and writing Mw
1.5 through 1.20 output, as specified by the user. Many specific
combinations are untested (e.g. I spent most work on 0.7 xds to MW
* Converting an sql dump file to a tab delimited format suitable for
'LOAD DATA INFILE' is now possible via 'sql2txt' (also *nix platforms).
I tested these on a smallish non-latin-character set wiki dump; a test
on en wikipedia is in the works but loading all those other tables, even
via LOAD DATA INFILE, takes some time.
Link to source:
So what I would love from folks is:
Test, find bugs, ask for features, tell me where other pain points are
in the import process. If you find bugs/want features and write a
patch, and you have a gerrit account, feel free to submit a changeset
righ there and add me as a reviewer. If you have a patch and don't have
an account, get one :-)
Once I know these are actually useful, I will try to make a dent in the
pages on Meta and elsewhere that describe, sometimes referring to
information several years old, how to import the dumps. Ah yeah and I'll
put up static binaries for linux/freebsd then too.
Good morning folks,
Peple monitoring the dumps progress pae will have noticed that dumps
from last night and this morning are broken due to a fatal error from
within MediaWiki. Dumps will be off-line until this is resolved,
hopefully late today.
May be you should use regular expressions that detect a long series of
numbers without spaces between them.
On Monday, February 11, 2013, wrote:
> Send Xmldatadumps-l mailing list submissions to
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> You can reach the person managing the list at
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Xmldatadumps-l digest..."
> Today's Topics:
> 1. Weird page titles in page table (Robert Crowe)
> 2. Re: Weird page titles in page table (Ariel T. Glenn)
> Message: 1
> Date: Sun, 10 Feb 2013 14:08:56 -0800
> Subject: [Xmldatadumps-l] Weird page titles in page table
> Message-ID: <010301ce07db$39477660$abd66320$@com>
> Content-Type: text/plain; charset="us-ascii"
> I'm seeing rows in the page table that have weird titles, and I'd like to
> able to identify and filter them out, but I don't see properties that seem
> to identify them. For example:
> page.page_id = 21441554
> page.page_title = 4567797074e280934d6f726f63636f5f72656c6174696f6e73
> What should I look for to identify pages like that?
> Message: 2
> Date: Mon, 11 Feb 2013 07:51:03 +0200
> Subject: Re: [Xmldatadumps-l] Weird page titles in page table
> Message-ID: <1360561863.18140.5.camel(a)trouble.localdomain>
> Content-Type: text/plain; charset="UTF-8"
> Στις 10-02-2013, ημέρα Κυρ, και ώρα 14:08 -0800, ο/η Robert Crowe
> > I'm seeing rows in the page table that have weird titles, and I'd like
> to be
> > able to identify and filter them out, but I don't see properties that
> > to identify them. For example:
> > page.page_id = 21441554
> > page.page_title = 4567797074e280934d6f726f63636f5f72656c6174696f6e73
> > What should I look for to identify pages like that?
> Which dump is this from?
> Xmldatadumps-l mailing list
> End of Xmldatadumps-l Digest, Vol 36, Issue 1
On 11/02/13 00:58, Robert Crowe wrote:
> Weird. Why is it that only some of the titles display as hex? I'm using phpMyAdmin, and the column is varbinary(255).
Maybe it's only doing so when it contains non-ascii chars?
(in the case of “Egypt–Morocco_relations”, that would be the en-dash)