Re: [Wikitech-l] Distributed content hosting

19 Feb 2007


      Hi!
...
The meat of the idea seems to be to use distributed hash tables to
allow the main database to be moved onto multiple mostly-independent
computers (i.e. break away from the inefficient MySQL
replication/cluster model).
DHTs aren't holy grail either. Google somehow uses InnoDB too for  
their critical apps, as well as other major shops (though everybody  
knows about the BigTable!)
If our only data access method would be getByKey(), we'd think about  
other types of storage, but it is not.
MySQL has the "cluster" product, which allows to distribute data over  
multiple boxes, but that adds somewhat not that efficient methods to  
do joins, sorts, etc.
Of course, right now we have multiple mostly-independent computers  
for revision text storage (as it, obviously, allows getByKey()-only  
access ;-)
...
This is absolutely something which should
be done.  Wikipedia's data model screams for the adoption of this
solution.
Wikipedia's data model can always use more appropriate tools, if  
they'd exist. ;-)
...
I question the benefit of then allowing untrusted third parties to run
the servers, though, because at the end of the paper you acknowledge
that all the data is going to have to pass back through trusted
parties anyway.
Trust is fairly complex issue - if incoming request is HTTP, it  
contains private information as (source ip, destination page).
That means setting up a network of wiki@home extended clients and  
finding who's browsing the questionable articles. In case of geo- 
proximity, that may be an issue.
...
Once you've achieved an approximately linear scaling of the
database servers, which the appropriate use of DHTs will do, it seems
to me that the costs of downloading the data from untrusted third
parties (doubling the bandwidth) and checking the signatures (eating
up CPU) is going to be nearly as great as the cost of simply adding
another database server.
Scaling databases with current dataset and accesses means adding  
another database server.
The only issue is enwiki master, which is not a bottleneck [yet]. I'm  
not against adding more efficiency though.
...
Let the end-user software check the signatures.
Most of our requests come from anonymous internet users. End-user  
software is out of question.
Best regards,
-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Distributed content hosting