Hi!
The meat of the idea seems to be to use distributed hash tables to allow the main database to be moved onto multiple mostly-independent computers (i.e. break away from the inefficient MySQL replication/cluster model).
DHTs aren't holy grail either. Google somehow uses InnoDB too for their critical apps, as well as other major shops (though everybody knows about the BigTable!) If our only data access method would be getByKey(), we'd think about other types of storage, but it is not.
MySQL has the "cluster" product, which allows to distribute data over multiple boxes, but that adds somewhat not that efficient methods to do joins, sorts, etc.
Of course, right now we have multiple mostly-independent computers for revision text storage (as it, obviously, allows getByKey()-only access ;-)
This is absolutely something which should be done. Wikipedia's data model screams for the adoption of this solution.
Wikipedia's data model can always use more appropriate tools, if they'd exist. ;-)
I question the benefit of then allowing untrusted third parties to run the servers, though, because at the end of the paper you acknowledge that all the data is going to have to pass back through trusted parties anyway.
Trust is fairly complex issue - if incoming request is HTTP, it contains private information as (source ip, destination page). That means setting up a network of wiki@home extended clients and finding who's browsing the questionable articles. In case of geo- proximity, that may be an issue.
Once you've achieved an approximately linear scaling of the database servers, which the appropriate use of DHTs will do, it seems to me that the costs of downloading the data from untrusted third parties (doubling the bandwidth) and checking the signatures (eating up CPU) is going to be nearly as great as the cost of simply adding another database server.
Scaling databases with current dataset and accesses means adding another database server. The only issue is enwiki master, which is not a bottleneck [yet]. I'm not against adding more efficiency though.
Let the end-user software check the signatures.
Most of our requests come from anonymous internet users. End-user software is out of question.
Best regards,