Hi all,
I am one of the co-authors of the IEEE Scale paper.
On Wednesday 23 July 2008, Dirk Riehle wrote:
Domas, thanks for your insights!
There is a fair amount of work on putting p2p architectures under wiki
engines but this work was the first to gain broader recognition, i.e.
win a prize at the IEEE Scale 2008 conference. So I'm assuming the work
is technically sound, even if it may not consider all the various
aspects of a real application. I asked one of the original authors to
comment on which of the issues you mention won't work well with their
architecture or whether they could easily be tacked on. Lets see whether
they'll show up.
> So, for now we have the task not to scale out
writes, but to scale
> reads (and read functionality) and maintain writes :)
According to the
wikipedia statistics 95% of the request are handled by the
squids. And scaling out the squids is not a hard problem. That is why we only
looked at the render farm and the databases.
As I understand the MySQL setup, you are running on a replicated MySQL
database. Read requests can be answered by any replica and write requests go
to all replicas. -> Adding more nodes does not increase your write capacity.
In our setup the replication degree is fixed. Every item is stored k times, no
matter how many nodes you are using. So the write capacity is increasing with
the number of database nodes.
>
> P2P designs work great for isolated data, our data is very
> interdependent (media, templates, links, categories, etc). It is
> difficult to establish data clustering easily, as there're multiple
> views from multiple directions.
> Now, once the P2P architecture has to maintain all that, I'd like to
> see what performs better in reasonable scaling requirements...
That is indeed
a problem. Our store only supports key-value pairs. You can see
it as a large map/dictionary. We basically denormalized the SQL scheme.
So we have one map for mapping title names to their content (list of
versions):
"title name" -> [page_content]
Another map stores the pages belonging to a category:
"category name" -> [title names]
You can add most features in this way.
When you update a page you have to update several of these maps. But that is
what the transactions are for.
>> This is a research project, but if their
numbers are right, they are
>> an
>> order of magnitude faster and leaner. Organizational and legal
>> implications aside, a p2p architecture like the Internet itself is
>> really what you would want for a next generation MediaWiki.
We looked at
several scenarios here. You could run a p2p system within your
data center because it scales better, is easier to maintain, etc. You could
run one p2p overlay over several datacenters (here: Florida, Amsterdam, South
Korea). Then you have to take care of data placement and network
partitioning. Or you could run the p2p overlay over the users' pcs. But then
you run into trust issues.
Thorsten