Re: [Mediawiki-l] Alternative implementations

30 Jul 2008

Hello,
...
  I am one of the co-authors of the IEEE Scale paper.

Nice work! My hats are off :)

...
  According to the wikipedia statistics 95% of the
request are handled  
 by the
 squids. And scaling out the squids is not a hard problem. That is  
 why we only
 looked at the render farm and the databases. 
Thats for anonymous users. Logged in users still hit the cluster.

Actually, 0.2% requests hitting the backend are saves (surprise!!!!)
There's lots of other stuff done, like previews, searches, watchlists,  
actual browsing, various meta stuff, etc.

...
  As I understand the MySQL setup, you are running on a
replicated MySQL
 database. Read requests can be answered by any replica and write  
 requests go
 to all replicas. -> Adding more nodes does not increase your write  
 capacity. 
Thats exactly right. We definitely have scaling bottleneck there. Our  
easiest way to scale writes is splitting off languages, and it will  
take quite some time to hit troubles on any non-English language (and  
English is doing fine at the moment too, with relatively low-grade  
database hardware).
Actually we used to hit scaling bottleneck once upon a time, back when  
we were saving revision texts into core database (and actually main  
tables, in pre-1.5 times). It was a single-day (or 15-minute, to add  
some dramatic effect) hack to move that stuff out of core databases.

...
  In our setup the replication degree is fixed. Every
item is stored k  
 times, no
 matter how many nodes you are using. So the write capacity is  
 increasing with
 the number of database nodes. 
Thats indeed nice for anything what requires simple key-value storage  
(and we definitely have such components, such as revision text  
storage, or various other simple metadata).

...
  When you update a page you have to update several of
these maps. But  
 that is
 what the transactions are for. 
Well, in this case, we end up with plentiful of maps:

Pages by unique ID, pages by name and title, pages by random value  
(ha!), pages by length (just in cases).
Every page then has multiple revisions, which are saved by page, by  
id, by timestamp, by timestamp-per-page, by timestamp-per-user, and by  
timestamp-per-usertext (for anons, mostly).
Every page then links to other pages, and is linked from other pages.
Every page then is embedded as a template somewhere else, or embeds  
templates.
Every page is in a category, or is an actual category.
Every page has broken links, that are tracked too.
Every page has images that have to be tracked.
Every page has external links
Every page has ...

And the biggest problem is, that for every of these maps, there're  
range scans or multiple reference reads. This leads to reading from  
100 nodes (unless lots of clever data clustering is employed) for  
every read done.
Of course, complexity of writes, when you don't go after infinite  
scaling, is much bigger too.

...
  We looked at several scenarios here. You could run a
p2p system  
 within your
 data center because it scales better, is easier to maintain, etc.  
 You could
 run one p2p overlay over several datacenters (here: Florida,  
 Amsterdam, South
 Korea). Then you have to take care of data placement and network
 partitioning. Or you could run the p2p overlay over the users' pcs.  
 But then
 you run into trust issues. 
Nah, user's PCs are out of question. We'd really want to see nice  
scalable stores for some of our data, which works great with key-value  
stores.
As well, we can probably offload some of biggest maps to somewhere  
'out there'.

Another problem is that servers die in batches. It is much easier to  
put one database server on one power feed and another database server  
on another, or place them in separate datacenters.
Once you have 1000 nodes that have to have HA characteristics as a  
whole, the lack of understanding which data goes where (think,  
availability zones), can lead to data lost, either temporary or  
permanently.

Of course, it is matter of engineering, but many of such problems are  
not resolved at research state.

-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Alternative implementations