On Mon, 25 Nov 2002 19:21:57 -0500
"Poor, Edmund W" <Edmund.W.Poor(a)abc.com> wrote:
Nick,
Your idea assumes that the "lag" problem is due to overloading a single
machine, which plays double roles: database backeand and web server. So,
if we divide the work amoung 2 or more machines, you expect faster
throughput. Right?
My idea is not to divide the roles of web server and backend. It is to
divide the workload of the one server between many servers. This includes
search queries and other functionality.
I envisage many wikipedia servers around the world, supported by private
individuals, companies and universities. Much like the system of mirror FTP
and mirror web sites. All these servers are updated in real time from the
core wikipedia server. From the user's perspective, all are equivalent.
Each of these servers can do everything the current wikipedia server can do
except for accepting update submissions. Updates from users are accepted
only by the core wiki server.
Reasons for such an architecture:
1) Growth of bandwidth useage may put financial pressure on Wikipedia.
Growth may follow a non-linear growth curve.
2) The cost of implementing one very fast, reliable, redundant machine is
more than the cost of farming out work to many quite fast, unreliable
systems none of which are mission critical. Especially true where there are
people willing to donate part of their hard drive, CPU and net connection
(or even an entire system) to a good cause such as wikipedia. (Overall
system reliability can be guaranteed by using DNS tricks to ensure users
and queries are only directed to working machines).
Given wikipedias limited technical resources, we need to make a choice
between making small scale changes to the system (which may make 30-50%
change in availability) or to make architectural changes which can scale to
improvements of 100x or 1000x magnitude.
The core server takes the update submissions. These are integrated into the
core database. Changes to the core database are reflected in all mirror
servers in real time by using a system of pushing the database update to
the mirror servers.
The core server will implement access control, ip blocking and other such
administration and policy.
I accept the technological suggestion I made is by no means the only way to
achieve this goal, although it may be a good one for it's scalability
potential.
Before mirrors are implemented in the way I suggested, It would be wise to
introduce meta-fields and records into the database. Fields which have no
current use but may be used in future wikipedia software releases. Future
wikipedia software releases for the mirror servers is guaranteed and extra
database fields are almost certainly going to be required. Adding mata
fields can help forward compatibility of databeses. This would be necessary
in advance as not all mirror servers will update their whole database and
software simaultaneously.