Hi!
I was traveling around a bit, missed some of threads entirely!
Ryan writes:
Any reason I would like to ask is why not use
PostgreSQL?
Any reason we should?
Seems MySQL is not suitable for handling large table
(e.g. over few
GB), I just wonder why wikipedia don't use PostgreSQL?
Is PG more suitable? Last time I looked at it, both engines were using
B+-Trees.
It should provide better performance.
Do you have any benchmarks on that?
Simetrical writes:
Heck, Wikipedia hasn't even upgraded to MySQL 4.1,
let
alone a whole different DBMS.
We do have 5.1 servers running in production for quite a while (e.g.
dewiki's lomaria :-)
We were running enwiki slaves on 5.0 too few times :) It is not like
there're any showstoppers for migration at the moment.
Antony writes:
If it is, you'd probably want to use partitioning
Partitioning makes selects faster only when there's parallel execution
on multiple partitions at once. PG doesn't have that, MySQL doesn't
have that, some commercial PG offsprings (Greenplum?) have it.
Simetrical writes again:
Let's not have a DBMS flame war here, please.
Oh come on, it has been a while. Nowadays we also need people from
NoSQL camp, telling we should migrate to ultimately scalable erlang-
based key/value/document storages, with lots of javascript map/reduce.
Jona writes:
The one thing that is slow is builiding the indexes
after the data
has been imported (eight hours or so).
People with not enough of RAM to have efficient b-tree builds can use
either InnoDB Plugin's fast-index-creation, or Tokutek's fractal tree
storage (which is commercial software, but has free license up to 50G,
or for developers, iirc)
Ryan asks:
May I ask why still using the 4.0 version?
Because it does what we need it to do, is rock-solid and fast enough.
Also because someone was lazy with 5.1 build engineering, but now
there's one nearly ready for production at lp:~wikimedia
Seems 5.1 above did provide much performance
enhancements?
Yes, some of them are same ones we had in our 4.0 builds for years,
others are ones we don't really need. We're read-i/o-constrained and
we're doing quite well at that with our current builds.
I don't know if it was supposed to be taken as
sarcasm or not, but
Tim Starling recently commented that "it seems that I'm the only
staff member who knows MySQL." That was a joke, right?
Tim is indeed the only one at the staff who knows really well how to
handle replicated MySQL setups, as well as other advanced MySQL
topics. Apparently it wasn't only WMF staff running Wikipedia's
databases.
Jona comments on utf8:
Yes. The main problem with using UTF-8 for the tables
is that
MySQL only supports Unicode characters U+0000 .. U+FFFF.
Other characters are silently removed, which leads to problems
with duplicate page titles etc.
Actually the main problem with using utf8 is that most of language-
specific collations are case-insensitive, which would mean lots of
pain with case senstive->case insensitive transition (due to how
indexes work, it is relatively difficult to have efficient sorting
order different from equality rules).
And yes, characters outside BMP can be an issue, but we would be
hitting that as a problem only in few page titles.
Dmitry suggests:
There was a message at
mysql.com site that google
performance
ehancements were incorporated into version 5.4.
Google performance enhancements were also incorporated into version
4.0.40. Not all, but most of ones we'd need (I/O related, we're not
really in shape with our datasets where we would care about SMP
performance ;-)
BR,
Domas