Hi!
I was traveling around a bit, missed some of threads entirely!
Ryan writes:
Any reason I would like to ask is why not use PostgreSQL?
Any reason we should?
Seems MySQL is not suitable for handling large table (e.g. over few GB), I just wonder why wikipedia don't use PostgreSQL?
Is PG more suitable? Last time I looked at it, both engines were using B+-Trees.
It should provide better performance.
Do you have any benchmarks on that?
Simetrical writes:
Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let alone a whole different DBMS.
We do have 5.1 servers running in production for quite a while (e.g. dewiki's lomaria :-) We were running enwiki slaves on 5.0 too few times :) It is not like there're any showstoppers for migration at the moment.
Antony writes:
If it is, you'd probably want to use partitioning
Partitioning makes selects faster only when there's parallel execution on multiple partitions at once. PG doesn't have that, MySQL doesn't have that, some commercial PG offsprings (Greenplum?) have it.
Simetrical writes again:
Let's not have a DBMS flame war here, please.
Oh come on, it has been a while. Nowadays we also need people from NoSQL camp, telling we should migrate to ultimately scalable erlang- based key/value/document storages, with lots of javascript map/reduce.
Jona writes:
The one thing that is slow is builiding the indexes after the data has been imported (eight hours or so).
People with not enough of RAM to have efficient b-tree builds can use either InnoDB Plugin's fast-index-creation, or Tokutek's fractal tree storage (which is commercial software, but has free license up to 50G, or for developers, iirc)
Ryan asks:
May I ask why still using the 4.0 version?
Because it does what we need it to do, is rock-solid and fast enough. Also because someone was lazy with 5.1 build engineering, but now there's one nearly ready for production at lp:~wikimedia
Seems 5.1 above did provide much performance enhancements?
Yes, some of them are same ones we had in our 4.0 builds for years, others are ones we don't really need. We're read-i/o-constrained and we're doing quite well at that with our current builds.
I don't know if it was supposed to be taken as sarcasm or not, but Tim Starling recently commented that "it seems that I'm the only staff member who knows MySQL." That was a joke, right?
Tim is indeed the only one at the staff who knows really well how to handle replicated MySQL setups, as well as other advanced MySQL topics. Apparently it wasn't only WMF staff running Wikipedia's databases.
Jona comments on utf8:
Yes. The main problem with using UTF-8 for the tables is that MySQL only supports Unicode characters U+0000 .. U+FFFF. Other characters are silently removed, which leads to problems with duplicate page titles etc.
Actually the main problem with using utf8 is that most of language- specific collations are case-insensitive, which would mean lots of pain with case senstive->case insensitive transition (due to how indexes work, it is relatively difficult to have efficient sorting order different from equality rules). And yes, characters outside BMP can be an issue, but we would be hitting that as a problem only in few page titles.
Dmitry suggests:
There was a message at mysql.com site that google performance ehancements were incorporated into version 5.4.
Google performance enhancements were also incorporated into version 4.0.40. Not all, but most of ones we'd need (I/O related, we're not really in shape with our datasets where we would care about SMP performance ;-)
BR, Domas