don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hi for everybody,
Im french,please sorry for my english.
Please correct me if it's not the right place for my help message
I have a problem with my French dump of Wikipedia using XML dump. I'm
having a problem with accented characters.
When i install Mediawiki, I choose innoBdb, this my MySQL configuration:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 179
Server version: 5.5.8-log MySQL Community Server (GPL)
mysql > status
c:/wamp/bin/mysql/mysql5.5.8/bin/mysql.exe Ver 14.14 Distrib 5.5.8, for
Win32 ( x86)
Connection id: 179
Current user: root@localhost
SSL: Not in use
Using delimiter: ;
Server version: 5.5.8-log MySQL Community Server (GPL)
Protocol version: 10
Connection: localhost via TCP/IP
Server characterset: latin1
Db characterset: latin1
Client characterset: cp850
Conn. characterset: cp850
TCP port: 3306
Uptime: 3 hours 47 min 6 sec
Slow queries: 3
Flush tables: 1 Open
Queries per second avg: 2.616
I'm using Mwdumper,this my code for the command
java -client -classpath %class% org.mediawiki.dumper.Dumper
I don't know the java language but with this,the tranfert to sql database
is good, but the accented characters are not good when I try to retrieve
articles. What can I do? Tank's a lot
My name is Wyatt, and I would like to present to you the first draft of my
GSoC proposal, available here:
and on the official Melange. On the Melange, should I clean out the
Mediawiki syntax, and convert it to look nice in their formatting, or is
leaving it wiki-fied ok?
I am not particularly familiar with mailing lists and their specific
etiquette, so please correct me if I do anything too outrageous.
I look forward to your feedback, and hopefully working with you in the
future, whether I am accepted for GSoC or not!
Hello dumps users and developers,
You may have noticed that the wikidata pages-logging xml dump step has
taken days for the last couple of runs. In fact for the most recent
run, it did not complete properly, as the database handling the query
was upgraded in the middle to mariadb.
So the short version is, if you are using that file, go get a new copy:
If I don't have a patch in by next run, I have a workaround I will run
by hand that takes 2 hours or less, as opposed to 4 days.
The long version is that the pages-logging file is already about half
the size of en wp's table, and that the number of edits per minute is
much larger, see:
There's a lot of deletion and a lot of churn too due to the dispatch
Also, they apparently have RCPatrol enabled and a pile of bots, which
means that the log consists of 99% entries 'bot X editing Y marked it as
These things in combo turn out to be the perfect storm for my simple
select query, causing it to start at normal speed and then get ever
slower. I suppose in another couple months it would take so long to run
it would never finish...
Sorry for my english I use to speak in french.
I have a problem with the encodages in french dump. i use the Api of
mediawiki to retrieve text of article.But there still have some %C3%A8
\u00e8 \ufffd to replace accented caracter(é,è,...).
How can i resolve this problem?
I hope that i posted this in the rigth place.
Chinese Wikipedia supports a few variants, zh-cn, zh-tw, zh-hk, same
wikitext is rendered differently under these variants. e.g. "software" in
zh-cn  and "software" in zh-tw .
But seems no HTML are included in dump file zhwiki.
Do you know where can I get the HTML version of articles on Chinese
This email may be confidential or privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it went to
the wrong person. Thanks.
Hi, I've downloaded the latest set of wikimedia dumps. I'm trying to
understand where to find images within these dumps. I've studied the
database schema and it seems to make sense, but then I take a single
example such as:
And I grep the dumps 'image', 'imagelinks', and 'page' looking for
'Carrizo_2a.JPG' and it's not found. I tried this on both the SQL and XML
Are these dumps not complete? Am I misunderstanding the structure?
Thanks in advance,
I'm doing research project on Wikipedia, so i need the Wikipedia data. I
decided to use the database dumps of Wikipedia for this purpose but there
are too much files there i don't know which file populates the which table.
Would you please provide some information that tells the dumps file mapping
to exact DB table.
Your prompt response is much appreciated.
Hello folks, it's time for more alpha code around making imports suck
less. The point of these tools, which augment the last ones published,
is to allow folks to generate sql from a subset of page content, using
the sql table dumps we provide and a downloaded (by one of these scripts
or some other means) XML file of page content for import.
I wanted a way to take importDump.php out of the loop, if the user finds
that the script is too slow, too picky, too whatever. So this is one of
The idea here is to get people thinking about how we can make small (or
large) chunks of content more available to people. These are really
meant to be demos of an idea, with the hope that others (you!) will find
better ways to implement it, or even better ideas.
Even so, please play, test, report bugs, submit patches, write new
tools, etc. See the code below: