I believe Wikipedia is being held back in terms of how many people can use it and how it can grow, through architectural constraints.
The current architecture of one machine taking the entire burden of all searches, updates and web page delivery inherently limits the rate at which Wikipedia can grow.
In order for Wikipedia to grow, it needs an architecture which can easily devolve work to other servers. A main database is still required to enforce administrative policy and maintain database consistency.
Work to improve the speed of the database and reduce lag will, in the long run, only be of very limited benefit and, perhaps, reduce the amount of lag users experience for a few days or weeks.
A method of easily implementing mirror servers with live, real-time updates is required. Each mirror server should cater for all the functionality users expect from Wikipedia except for taking care of form submissions of updates, which should be forwarded to the master wiki server.
The main database server should be released from the burden of serving web pages and concentrate on running administrative code, processing and posting database updates.
The update system can be achieved by either: 1) the main server creating SQL files of incremental changes to be emailed to mirror servers, signed with a key pair, sequentially numbered to ensure they are automatically processed in order this way, the server can run asynchronously with the mirrors which is better for reliability of the server. The server will not need to wait for connection responses from the mirror and updates will be cached in the mail system in the event that the mirror server be unavailable. (The main server will then only need to create one email per update. The mail system infrastructure will take care of sending the data to each mirror. In fact, a system such as pipermail used on this list would solve the problem wonderfully. Mirror admins simply subscribe to the list to get all updates sent to their machine and can manually download updates they are missing from the list!)
Or 2) by the master server opening a connection directly to the SQL daemon on each remote machine. In which case the server will need to track what the mirrors have and have not received updates and need to wait for time-out on non-operational mirrors)(this system may open exploits on the server via the sql interface).
here's ,my take
split webservers and db servers.
1 master db, only write query's come to this db as needed add slave servers, these only do read query's
add webservers as needed
this is the easiest way to go about it using mysql's built in replication feature.. it makes the most sense too in my book....
the only thing needed to make wikipedia work like this is a db connection library that looks at an SQL statement and routes it to where it's supposed to be.. i wrote a db library for mysql in php once that did all this, its pretty cool if i may say so.. if you are interested i'll send you the code, its part of a much bigger project, but i figure any decent php programmer should be able to grasp the concept of it... it might not be super efficient cause i programmed this when i didn't know many tricks and was kinda still learning, but it works.. oh well .. if anyone is interested let me know
Lightning
----- Original Message ----- From: "Nick Hill" nick@nickhill.co.uk To: wikitech-l@wikipedia.org Sent: Monday, November 25, 2002 4:53 PM Subject: [Wikitech-l] Long term plans for scalability
I believe Wikipedia is being held back in terms of how many people can use it and how it can grow, through architectural constraints.
The current architecture of one machine taking the entire burden of all searches, updates and web page delivery inherently limits the rate at
which
Wikipedia can grow.
In order for Wikipedia to grow, it needs an architecture which can easily devolve work to other servers. A main database is still required to
enforce
administrative policy and maintain database consistency.
Work to improve the speed of the database and reduce lag will, in the long run, only be of very limited benefit and, perhaps, reduce the amount of
lag
users experience for a few days or weeks.
A method of easily implementing mirror servers with live, real-time
updates
is required. Each mirror server should cater for all the functionality users expect from Wikipedia except for taking care of form submissions of updates, which should be forwarded to the master wiki server.
The main database server should be released from the burden of serving web pages and concentrate on running administrative code, processing and posting database updates.
The update system can be achieved by either:
- the main server creating SQL files of incremental changes to
be emailed to mirror servers, signed with a key pair, sequentially numbered to ensure they are automatically processed in order this way, the server can run asynchronously with the mirrors which is better for reliability of the server. The server will not need to wait for connection responses from the mirror and updates will be cached in the mail system in the event that the mirror server be unavailable. (The main server will then only need to create one email per update. The mail system infrastructure will take care of sending the data to each mirror. In fact, a system such as pipermail used on this list would solve the problem wonderfully. Mirror admins simply subscribe to the list to get all updates sent to their machine and can manually download updates they are missing from the list!)
Or 2) by the master server opening a connection directly to the SQL daemon on each remote machine. In which case the server will need to track what the mirrors have and have not received updates and need to wait for time-out on non-operational mirrors)(this system may open exploits on the server via the sql interface).
Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org