On Sat, Nov 21, 2009 at 9:48 PM, Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com wrote:
The one thing that is slow is builiding the indexes after the data has been imported (eight hours or so). Maybe we could omit some indexes that are not used in our application, but I haven't really looked into that.
MyISAM should be able to build keys quite quickly; it can do so by sorting if all goes well. That's commonly ten times as fast as rebuilding by keycache, or more. I don't know offhand how mwdumper works, but you might want to try increasing the value of myisam_max_sort_file_size to something larger than all your indexes, if you have enough disk space. Or you could do something like import into a table with no keys at all, create a table with the same definition but with keys and no data, shut down MySQL, copy the new table's .frm and .MYI over the old table's, run myisamchk --sort-recover on the old table (with the data and copied .frm and .MYI), and restart MySQL, but this is at your own risk, of course. :)
On Sat, Nov 21, 2009 at 10:36 PM, Anthony wikimail@inbox.org wrote:
Quite a lot? The theoretical max would be about twice as fast, as (page_is_redirect, page_namespace, page_title) is going to take up at least half as much space as the whole page table.
Yes, that's true. Twice as fast is still a pretty good improvement, though. :)
I know it wouldn't be reached by PostgreSQL, which would still do a sequential scan through the table.
MySQL skips the table and just looks at the index, in both InnoDB and MyISAM.
mysql> EXPLAIN SELECT page_namespace, page_title FROM page WHERE page_is_redirect=0; +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+ | 1 | SIMPLE | page | ALL | NULL | NULL | NULL | NULL | 43603 | Using where | +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+ 1 row in set (0.00 sec)
mysql> EXPLAIN SELECT page_namespace, page_title FROM page2 WHERE page_is_redirect=0; +----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+ | 1 | SIMPLE | page2 | ref | page_is_redirect | page_is_redirect | 1 | const | 22048 | Using index | +----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+ 1 row in set (0.00 sec)
Note "Using index" in the second query; it never looks at the table data, only the index. (Those queries are on a local copy of Simple, not enwiki, thus the small row counts.)
I'm actually in the process of trying to import enwiki into a postgresql database right now. Attempt 1 was to import everything into a MySQL database (with no indexes), export it as a TSV file, then import from the TSV file into Postgresql. Hit a snag with some data that Postgres is saying isn't valid UTF8, which is probably due to something I did wrong with the import, but I can't figure out what it is.
The most likely problem is that it's not actually valid UTF-8. There are some places where we end up truncating things bytewise rather than characterwise, like edit summaries, leaving only the first byte or two of a multibyte character. Last I checked, edit summaries were just passed to the database with no hard length check, so MySQL in non-strict mode with a binary schema (like Wikipedia) will just truncate them to fit. (In strict mode it will raise an error, and if we used utf8 rather than binary it would presumably truncate by characters.)
On Sat, Nov 21, 2009 at 10:40 PM, Ryan Chan ryanchan404@gmail.com wrote:
Is the source available in the svn? Can you point me to the right direction?