Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL

24 Nov 2009


      Hi!
I was traveling around a bit, missed some of threads entirely!
Ryan writes:
...
Any reason I would like to ask is why not use PostgreSQL?
Any reason we should?
...
Seems MySQL is not suitable for handling large table (e.g. over few
GB), I just wonder why wikipedia don't use PostgreSQL?
Is PG more suitable? Last time I looked at it, both engines were using  
B+-Trees.
...
It should provide better performance.
Do you have any benchmarks on that?
Simetrical writes:
...
Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
alone a whole different DBMS.
We do have 5.1 servers running in production for quite a while (e.g.  
dewiki's lomaria :-)
We were running enwiki slaves on 5.0 too few times :) It is not like  
there're any showstoppers for migration at the moment.
Antony writes:
...
If it is, you'd probably want to use partitioning
Partitioning makes selects faster only when there's parallel execution  
on multiple partitions at once. PG doesn't have that, MySQL doesn't  
have that, some commercial PG offsprings (Greenplum?) have it.
Simetrical writes again:
...
Let's not have a DBMS flame war here, please.
Oh come on, it has been a while. Nowadays we also need people from  
NoSQL camp, telling we should migrate to ultimately scalable erlang- 
based key/value/document storages, with lots of javascript map/reduce.
Jona writes:
...
The one thing that is slow is builiding the indexes after the data  
has been imported (eight hours or so).
People with not enough of RAM to have efficient b-tree builds can use  
either InnoDB Plugin's fast-index-creation, or Tokutek's fractal tree  
storage (which is commercial software, but has free license up to 50G,  
or for developers, iirc)
Ryan asks:
...
May I ask why still using the 4.0 version?
Because it does what we need it to do, is rock-solid and fast enough.  
Also because someone was lazy with 5.1 build engineering, but now  
there's one nearly ready for production at lp:~wikimedia
...
Seems 5.1 above did provide much performance enhancements?
Yes, some of them are same ones we had in our 4.0 builds for years,  
others are ones we don't really need. We're read-i/o-constrained and  
we're doing quite well at that with our current builds.
...
I don't know if it was supposed to be taken as sarcasm or not, but  
Tim Starling recently commented that "it seems that I'm the only  
staff member who knows MySQL."  That was a joke, right?
Tim is indeed the only one at the staff who knows really well how to  
handle replicated MySQL setups, as well as other advanced MySQL  
topics. Apparently it wasn't only WMF staff running Wikipedia's  
databases.
Jona comments on utf8:
...
Yes. The main problem with using UTF-8 for the tables is that
MySQL only supports Unicode characters U+0000 .. U+FFFF.
Other characters are silently removed, which leads to problems
with duplicate page titles etc.
Actually the main problem with using utf8 is that most of language- 
specific collations are case-insensitive, which would mean lots of  
pain with case senstive->case insensitive transition (due to how  
indexes work, it is relatively difficult to have efficient sorting  
order different from equality rules).
And yes, characters outside BMP can be an issue, but we would be  
hitting that as a problem only in few page titles.
Dmitry suggests:
...
There was a message at mysql.com site that google performance
ehancements were incorporated into version 5.4.
Google performance enhancements were also incorporated into version  
4.0.40. Not all, but most of ones we'd need (I/O related, we're not  
really in shape with our datasets where we  would care about SMP  
performance ;-)
BR,
Domas

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Size of DB/table of enwiki after import into MySQL