Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

22 Nov 2009


      On Sat, Nov 21, 2009 at 9:48 PM, Jona Christopher Sahnwaldt
jcsahnwaldt@gmail.com wrote:
...
The one thing that is slow is builiding the indexes after
the data has been imported (eight hours or so). Maybe
we could omit some indexes that are not used in our
application, but I haven't really looked into that.
MyISAM should be able to build keys quite quickly; it can do so by
sorting if all goes well.  That's commonly ten times as fast as
rebuilding by keycache, or more.  I don't know offhand how mwdumper
works, but you might want to try increasing the value of
myisam_max_sort_file_size to something larger than all your indexes,
if you have enough disk space.  Or you could do something like import
into a table with no keys at all, create a table with the same
definition but with keys and no data, shut down MySQL, copy the new
table's .frm and .MYI over the old table's, run myisamchk
--sort-recover on the old table (with the data and copied .frm and
.MYI), and restart MySQL, but this is at your own risk, of course.  :)
On Sat, Nov 21, 2009 at 10:36 PM, Anthony wikimail@inbox.org wrote:
...
Quite a lot?  The theoretical max would be about twice as fast, as
(page_is_redirect, page_namespace, page_title) is going to take up at
least half as much space as the whole page table.
Yes, that's true.  Twice as fast is still a pretty good improvement, though.  :)
...
I know it wouldn't be reached by PostgreSQL, which would still do a
sequential scan through the table.
MySQL skips the table and just looks at the index, in both InnoDB and MyISAM.
mysql> EXPLAIN SELECT page_namespace, page_title FROM page WHERE
page_is_redirect=0;
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len |
ref  | rows  | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
|  1 | SIMPLE      | page  | ALL  | NULL          | NULL | NULL    |
NULL | 43603 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT page_namespace, page_title FROM page2 WHERE
page_is_redirect=0;
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys    | key
 | key_len | ref   | rows  | Extra       |
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
|  1 | SIMPLE      | page2 | ref  | page_is_redirect |
page_is_redirect | 1       | const | 22048 | Using index |
+----+-------------+-------+------+------------------+------------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
Note "Using index" in the second query; it never looks at the table
data, only the index.  (Those queries are on a local copy of Simple,
not enwiki, thus the small row counts.)
...
I'm actually in the process of trying to import enwiki into a
postgresql database right now.  Attempt 1 was to import everything
into a MySQL database (with no indexes), export it as a TSV file, then
import from the TSV file into Postgresql.  Hit a snag with some data
that Postgres is saying isn't valid UTF8, which is probably due to
something I did wrong with the import, but I can't figure out what it
is.
The most likely problem is that it's not actually valid UTF-8.  There
are some places where we end up truncating things bytewise rather than
characterwise, like edit summaries, leaving only the first byte or two
of a multibyte character.  Last I checked, edit summaries were just
passed to the database with no hard length check, so MySQL in
non-strict mode with a binary schema (like Wikipedia) will just
truncate them to fit.  (In strict mode it will raise an error, and if
we used utf8 rather than binary it would presumably truncate by
characters.)
On Sat, Nov 21, 2009 at 10:40 PM, Ryan Chan ryanchan404@gmail.com wrote:
...
Is the source available in the svn? Can you point me to the right direction?
http://svn.wikimedia.org/viewvc/mysql/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL