Performance problems importing meta-history.xml

List overview All Threads
Download

newer

older

HOWTO on setting up a WP mirror?

Re: [Wikitech-l] WikiCreole 1.0 -...

Christoph Litauer

5 Jul 2007 5 Jul '07

9:46 a.m.

Hi,

for a few weeks now I try to setup a local wikipedia-Server. For some science reasons we need to import the german wikipedia with all history information. I downloaded the compressed XML file, and used mwimport to convert the XML data to a sql command file. The resulting file is about 500 MB and contains about 50 million lines of sql commands.

I pumped this file into a mysql command. The mysql server is running on the same machine, 2GB of memory. I configured mysql to use 1.5 GB as buffer for innodb (innodb_buffer_pool_size).

The mysql command runs for 2 weeks now and imported about 20 million of 29 million article revisions. This seems to be extremly slow for me. I think I must be doing something wrong, but I cannot find a mistake.

A simple counting command takes between 3 and 7 hours:

mysql> select count(rev_id) from revision; +---------------+ | count(rev_id) | +---------------+ | 20923026 | +---------------+ 1 row in set (7 hours 5 min 53.40 sec)

How can that be? Any ideas how to improve the performance? Thanks a lot in advance!

-- Regards Christoph ________________________________________________________________________ Christoph Litauer litauer@uni-koblenz.de Uni Koblenz, Computing Center, http://www.uni-koblenz.de/~litauer Postfach 201602, 56016 Koblenz Fon: +49 261 287-1311, Fax: -100 1311 PGP-Fingerprint: F39C E314 2650 650D 8092 9514 3A56 FBD8 79E3 27B2

Show replies by date

Minute Electron

5 Jul 5 Jul

11:05 a.m.

On 7/5/07, Christoph Litauer litauer@uni-koblenz.de wrote:

...

Hi,

for a few weeks now I try to setup a local wikipedia-Server. For some science reasons we need to import the german wikipedia with all history information. I downloaded the compressed XML file, and used mwimport to convert the XML data to a sql command file. The resulting file is about 500 MB and contains about 50 million lines of sql commands.

I pumped this file into a mysql command. The mysql server is running on the same machine, 2GB of memory. I configured mysql to use 1.5 GB as buffer for innodb (innodb_buffer_pool_size).

The mysql command runs for 2 weeks now and imported about 20 million of 29 million article revisions. This seems to be extremly slow for me. I think I must be doing something wrong, but I cannot find a mistake.

A simple counting command takes between 3 and 7 hours:

mysql> select count(rev_id) from revision; +---------------+ | count(rev_id) | +---------------+ | 20923026 | +---------------+ 1 row in set (7 hours 5 min 53.40 sec)

How can that be? Any ideas how to improve the performance? Thanks a lot in advance!

-- Regards Christoph ________________________________________________________________________ Christoph Litauer litauer@uni-koblenz.de Uni Koblenz, Computing Center, http://www.uni-koblenz.de/~litauer Postfach 201602, 56016 Koblenz Fon: +49 261 287-1311, Fax: -100 1311 PGP-Fingerprint: F39C E314 2650 650D 8092 9514 3A56 FBD8 79E3 27B2

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

There is a nice tool called mwdumper which can be found in the SVN repository and is written in Java so imports pages much faster. MinuteElectron.

Christoph Litauer

11:12 a.m.

Minute Electron schrieb:

...

On 7/5/07, Christoph Litauer litauer@uni-koblenz.de wrote:

...
Hi,

for a few weeks now I try to setup a local wikipedia-Server. For some science reasons we need to import the german wikipedia with all history information. I downloaded the compressed XML file, and used mwimport to convert the XML data to a sql command file. The resulting file is about 500 MB and contains about 50 million lines of sql commands.

I pumped this file into a mysql command. The mysql server is running on the same machine, 2GB of memory. I configured mysql to use 1.5 GB as buffer for innodb (innodb_buffer_pool_size).

The mysql command runs for 2 weeks now and imported about 20 million of 29 million article revisions. This seems to be extremly slow for me. I think I must be doing something wrong, but I cannot find a mistake.

A simple counting command takes between 3 and 7 hours:

mysql> select count(rev_id) from revision; +---------------+ | count(rev_id) | +---------------+ | 20923026 | +---------------+ 1 row in set (7 hours 5 min 53.40 sec)

How can that be? Any ideas how to improve the performance? Thanks a lot in advance!

-- Regards Christoph

There is a nice tool called mwdumper which can be found in the SVN repository and is written in Java so imports pages much faster.

Thanks, but I already figured mwdumper out: "Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL." I already produced raw SQL (using mwimport), so it's not the XML to SQL conversion that is the bottleneck. I think mwdumper just improves this step but not the data import to the database.

Brion Vibber

4:08 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Christoph Litauer wrote:

...

Thanks, but I already figured mwdumper out: "Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL."

Yes, you have to run tables.sql into your database as well. Painful, I know. ;)

...

I already produced raw SQL (using mwimport), so it's not the XML to SQL conversion that is the bottleneck. I think mwdumper just improves this step but not the data import to the database.

I don't know anything about this "mwimport" tool, but mwdumper uses batch inserts and the README includes a number of tips about speeding up the SQL end. You might want to check if you're doing this already.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGjPtAwRnhpk1wk44RAjZgAKCRExPmdRIGULn6dALNM5PL+DVWggCeOVVg AYHlxViAD/KDCOaokcDNnv4= =OWA9 -----END PGP SIGNATURE-----

Tim Starling

6:31 p.m.

Brion Vibber wrote:

...

Christoph Litauer wrote:

...
Thanks, but I already figured mwdumper out: "Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL."

Yes, you have to run tables.sql into your database as well. Painful, I know. ;)

...
I already produced raw SQL (using mwimport), so it's not the XML to SQL conversion that is the bottleneck. I think mwdumper just improves this step but not the data import to the database.

I don't know anything about this "mwimport" tool, but mwdumper uses batch inserts and the README includes a number of tips about speeding up the SQL end. You might want to check if you're doing this already.

He probably means this:

http://meta.wikimedia.org/wiki/Data_dumps/mwimport

It claims to be faster than mwdumper due to lower CPU usage during XML parsing. I suspect you could get the same speedup by putting "bfr" in the pipeline, since I very much doubt you'd max out the CPU while piping into MySQL, if the whole thing was properly multithreaded.

The problem in Christoph Litauer's case is most probably insufficent memory and disk resources, possibly coupled with a poorly tuned MySQL server. Fixing that is probably a better topic for a MySQL support query than a wikitech-l mailing list thread.

-- Tim Starling

Christoph Litauer

6:37 p.m.

Tim Starling schrieb:

...

Brion Vibber wrote:

...
Christoph Litauer wrote:

...
Thanks, but I already figured mwdumper out: "Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL."

Yes, you have to run tables.sql into your database as well. Painful, I know. ;)

...
I already produced raw SQL (using mwimport), so it's not the XML to SQL conversion that is the bottleneck. I think mwdumper just improves this step but not the data import to the database.

I don't know anything about this "mwimport" tool, but mwdumper uses batch inserts and the README includes a number of tips about speeding up the SQL end. You might want to check if you're doing this already.

He probably means this:

http://meta.wikimedia.org/wiki/Data_dumps/mwimport

It claims to be faster than mwdumper due to lower CPU usage during XML parsing. I suspect you could get the same speedup by putting "bfr" in the pipeline, since I very much doubt you'd max out the CPU while piping into MySQL, if the whole thing was properly multithreaded.

The problem in Christoph Litauer's case is most probably insufficent memory and disk resources, possibly coupled with a poorly tuned MySQL server. Fixing that is probably a better topic for a MySQL support query than a wikitech-l mailing list thread.

I totally agree! I hoped to get statements like "same for me" or "things run about 20 times faster here" -- but I didn't ask for it, that's right. I couldn't find any hints how fast the imports "normally" run, and as a result if it's worth to spend time optimizing my mysql server. Seems as if it is worth, so I will take a look at that point. Thank you all for the answers and hints.

-- Regards Christoph

Felipe Ortega

11:08 p.m.

I completely agree with Tim's answer. Proper hard disk resources (in terms of speed) are critical.

MySQL INSERTS don't overload the whole process very much. I usually do the final inserts to MySQL separately and it doesn't take more than the 5-10% of the whole processing time in any case.

I believe that Perl parser is somewhat faster than mwdumper (and definitely, faster than the research version of my WikiXRay new parser; I still have to check them against the next standard version of my parser).

However, you should take into account that, depending on the language you're processing, certain revisions could have a *very big size*, and inevitably any parser, no matter how improved or multithreaded it is, will spend a considerable time to process them (I mean, when we look at the aggregate number of tasks the parser must face off).

It usually takes a week or so for me to load back to MySQL the whole dump of the English version, and that's with a big server with 2 Opteron 2GHz (double-core each one), a lot of fast memory and a RAID 6 array of 8 fast SATA-II disks.

MySQL configuration will be critical later, when you try to "play with your data". I recommend you www.mysqlperformanceblog.com for that. You'll find very useful hints there.

Good luck.

Felipe.

Christoph Litauer litauer@uni-koblenz.de escribió: Tim Starling schrieb:

...

Brion Vibber wrote:

...
Christoph Litauer wrote:

...
Thanks, but I already figured mwdumper out: "Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL."

Yes, you have to run tables.sql into your database as well. Painful, I know. ;)

...
I already produced raw SQL (using mwimport), so it's not the XML to SQL conversion that is the bottleneck. I think mwdumper just improves this step but not the data import to the database.

I don't know anything about this "mwimport" tool, but mwdumper uses batch inserts and the README includes a number of tips about speeding up the SQL end. You might want to check if you're doing this already.

He probably means this:

http://meta.wikimedia.org/wiki/Data_dumps/mwimport

It claims to be faster than mwdumper due to lower CPU usage during XML parsing. I suspect you could get the same speedup by putting "bfr" in the pipeline, since I very much doubt you'd max out the CPU while piping into MySQL, if the whole thing was properly multithreaded.

The problem in Christoph Litauer's case is most probably insufficent memory and disk resources, possibly coupled with a poorly tuned MySQL server. Fixing that is probably a better topic for a MySQL support query than a wikitech-l mailing list thread.

-- Regards Christoph _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l --------------------------------- LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com

Christoph Litauer

6 Jul 6 Jul

7:27 a.m.

Felipe Ortega schrieb:

...

I completely agree with Tim's answer. Proper hard disk resources (in terms of speed) are critical.

MySQL INSERTS don't overload the whole process very much. I usually do the final inserts to MySQL separately and it doesn't take more than the 5-10% of the whole processing time in any case.

I believe that Perl parser is somewhat faster than mwdumper (and definitely, faster than the research version of my WikiXRay new parser; I still have to check them against the next standard version of my parser).

However, you should take into account that, depending on the language you're processing, certain revisions could have a *very big size*, and inevitably any parser, no matter how improved or multithreaded it is, will spend a considerable time to process them (I mean, when we look at the aggregate number of tasks the parser must face off).

It usually takes a week or so for me to load back to MySQL the whole dump of the English version, and that's with a big server with 2 Opteron 2GHz (double-core each one), a lot of fast memory and a RAID 6 array of 8 fast SATA-II disks.

MySQL configuration will be critical later, when you try to "play with your data". I recommend you www.mysqlperformanceblog.com for that. You'll find very useful hints there.

Thanks a lot Felipe, seems as if my two weeks importing time are OK so far.

6359

Age (days ago)

6360

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Christoph Litauer
Felipe Ortega
Minute Electron
Tim Starling