Hi.
I know ideas for a distributed Wikipedia have been discussed here before, but I haven't seen the following angle before (this may just be because I don't read my mail carefully enough of course):
Let's say you have 10 Wikipedia servers. All of them dispense articles for reading directly. When an article is about to be edited, the title is hashed, and the corresponding server is contacted.
That way, each article "belongs" on one of X server, so a lot of consistency problems disappear. That server can again notify the others about changes in its "own" articles.
If a server goes down, it's not the end of the world, an Xth of the articles aren't editable for a while. The downed server's "lease" can be revoked after an hour, perhaps.
While I don't have the capacity to implement this myself (just fending off the "send patches"), it doesn't appear to be such a gigantic departure from the existing system. (Yes, big. But not rewrite.)
Wouldn't it be worth looking into? It seems increased demand eats up every regular measure taken (kind of like congested highways becoming even more congested when they're expanded).
-- Daniel
Hr. Daniel Mikkelsen wrote:
Hi.
I know ideas for a distributed Wikipedia have been discussed here
before, but I
haven't seen the following angle before (this may just be because I
don't read
my mail carefully enough of course):
Let's say you have 10 Wikipedia servers. All of them dispense
articles for
reading directly. When an article is about to be edited, the title is
hashed,
and the corresponding server is contacted.
That way, each article "belongs" on one of X server, so a lot of
consistency
problems disappear. That server can again notify the others about
changes in
its "own" articles.
If a server goes down, it's not the end of the world, an Xth of the
articles
aren't editable for a while. The downed server's "lease" can be
revoked after
an hour, perhaps.
While I don't have the capacity to implement this myself (just
fending off the
"send patches"), it doesn't appear to be such a gigantic departure
from the
existing system. (Yes, big. But not rewrite.)
This sounds like a good idea, but we have two servers. Are you offering to donate the other 8?
Wouldn't it be worth looking into? It seems increased demand eats up
every
regular measure taken (kind of like congested highways becoming even
more
congested when they're expanded).
We have a dirt road. When we start building the highway, we can talk about onramps.
-- Tim Starling <../t/starling/physics/unimelb/edu/au>
On Fri, 12 Sep 2003, Tim Starling wrote:
Hr. Daniel Mikkelsen wrote:
I know ideas for a distributed Wikipedia have been discussed here haven't seen the following angle before (this may just be because I my mail carefully enough of course):
[ Idea. ]
This sounds like a good idea, but we have two servers. Are you offering to donate the other 8?
I have already seen mails from several people volunteering bandwidth and server space around the world here.
Wikipedia is such a useful, likeable and interesting project that I can't imagine it would be hard to find lots of people with large servers and fast lines to help out.
Just think of the guys at universities running those gigantic FTP mirrors.
I don't think the lack of servers it the reason Wikipedia isn't a distributed system.
Wouldn't it be worth looking into? It seems increased demand eats up regular measure taken (kind of like congested highways becoming even congested when they're expanded).
We have a dirt road. When we start building the highway, we can talk about onramps.
Building capacity in small incremental steps hasn't given Wikipedia more than a month's respite or two before things are bogged down again. I think a distribued version could more or less do away with the capacity problem. (Since each article lives a separate life with it's own history, Wikipedia is actually ideal for running across a bunch of servers.) I don't think clever cacheing or two new Opteron processors can.
-- Daniel
On Thu, 2003-09-11 at 17:44, Hr. Daniel Mikkelsen wrote:
Let's say you have 10 Wikipedia servers. All of them dispense articles for reading directly. When an article is about to be edited, the title is hashed, and the corresponding server is contacted.
That way, each article "belongs" on one of X server, so a lot of consistency problems disappear. That server can again notify the others about changes in its "own" articles.
We've got about 3-4 edits per minute. Distributing writes sounds like a lot more trouble than it's worth, possibly leading to all kinds of consistency troubles on shared resources (link tables...).
Tacking on replicated database server(s) to handle read-only requests (hundreds per minute) would be simpler and less fragile (slave dies, just take it out of rotation and -no- pages become inaccessible; master dies, just declare one of the slaves the new master).
Before we bother about anything like that, we just need a decently fast machine for the web server!
-- brion vibber (brion @ pobox.com)
On Fri, Sep 12, 2003 at 02:44:46AM +0200, Hr. Daniel Mikkelsen wrote:
Hi.
I know ideas for a distributed Wikipedia have been discussed here before, but I haven't seen the following angle before (this may just be because I don't read my mail carefully enough of course):
Let's say you have 10 Wikipedia servers. All of them dispense articles for reading directly. When an article is about to be edited, the title is hashed, and the corresponding server is contacted.
That way, each article "belongs" on one of X server, so a lot of consistency problems disappear. That server can again notify the others about changes in its "own" articles.
If a server goes down, it's not the end of the world, an Xth of the articles aren't editable for a while. The downed server's "lease" can be revoked after an hour, perhaps.
While I don't have the capacity to implement this myself (just fending off the "send patches"), it doesn't appear to be such a gigantic departure from the existing system. (Yes, big. But not rewrite.)
Wouldn't it be worth looking into? It seems increased demand eats up every regular measure taken (kind of like congested highways becoming even more congested when they're expanded).
Such design was already proposed (like by me), but there would still be a problem with parts of Wikipedia that require information from all servers - RecentChanges, watchlists, etc. If you have some idea how to elegantly solve the problem with that, please tell us.
Hr. Daniel Mikkelsen wrote:
I know ideas for a distributed Wikipedia have been discussed here before,
I hate to kill good ideas and discussion, and I'd love to work out the technical details of a distributed solution, but it's also extremely frustrating to work on a complicated solution when it turns out that nobody really needs it, so let's run through the old arguments again, just like a check list.
Yes, application-level distribution has been discussed, but even though nobody thinks it is right out impossible it has never been implemented, because it doesn't solve any problem that we have. The first question that a proposal must answer is: What problem are you trying to solve?
Wikipedia itself is the solution to a problem: How to create a free encyclopedia, especially without getting bogged down in the editorial process of Nupedia. For this, standard components are used, such as Apache, PHP, and MySQL. We could invent our own programming language or database, but we don't, becuase it could get us sidetracked, make us lose focus of the original goal. However, Wikipedia has developed its own wiki software, abandoning the existing UseModWiki software. So there is a fine line between inventing the new and using the existing. The innovations in the Wikipedia PHP script, that makes a difference from UseModWiki, are mainly aimed at supporting the editorial process (namespaces, statistics, uploads, etc.), or user experience (skins) and not towards performance architecture.
As far as I know, there are no distributed wikis in the world. Wikipedia is the world's biggest wiki, and would be the first to need such a solution, but so far runs at a single site.
One "popular" problem to discuss is work load and response times. However, with the current software architecture (PHP + Apache + MySQL) it is possible to distribute the work load over a large number of computers at a single site without any application-level distribution. It is also possible to identify software bottle necks and improve the performance without adding hardware. Wikipedia has been "slow" before, when it had far less work load than today, and some people thought this was the end of the road, but software improvements showed that it was possible to handle the load.
In addition to the technical challenge, application-level distribution brings many new problems with contract law and administration: Who runs each site? What are the legal contracts between those responsible for each site? What to do if a site becomes unavailable or goes out of business? Etc. Those problems are avoided by sticking to a single site. If you propose a distributed solution, you would have to include answers to these administrative and legal questions.
Still, there is a possibility that new kinds of problems can be solved by a distributed architecture. What about the Russian users who have to pay extra for accessing international websites, but can use Russian websites within their flat fee subscription plan? If this problem can be addressed by a distributed server inside Russia, you would have to answer who can run it and what does it cost to keep it updated if Internet traffic across the border is not flat rate?
wikitech-l@lists.wikimedia.org