On Mon, 2003-04-28 at 14:14, Erik Moeller wrote:
The problem with mailing list discussion is that they
can die
quickly, for many reasons, which can delay things unnecessarily. I've seen
many situations where a mailing list was used to report a serious problem,
but the post (in spite of hundreds of members) was ignored.
Reporting a serious problem is all well and good, but isn't the same as
_fixing_ it.
We all know that the performance issue is one of our
most pressing
problems right now -- many people can't use the site anymore, and the
international Wikipedians are getting a bit irritated. So I think the best
way to address this *on time* is to sit down (virtually) and go through an
agenda.
We don't need to sit and chat. We need *code* and we need a second
server to divide the "must-do-fast" web work the and "chug-chug-chug"
database labor.
Here are some things you can work on if you've got time to spend on
Wikipedia coding:
* Page viewing is still kinda inefficient. Rendering everything on every
view is not so good... Caching can save both processing time in
conversion to HTML, and in various database accesses (checking link
tables, etc) with its associated potential locking overhead.
We need to either be able to cache the HTML of entire pages (followed by
insertion of user-specific data/links or simple options through style
sheet selection or string replacement) or to cache just the generated
HTML of the wiki pages for insertion into the page structure (plus
associated data, like interlanuage links, need to be accessible without
parsing the page).
We need to tell which pages are or aren't cacheable (not a diff, not a
special page, not a history revision, not a user with really weird
display options -- or on the other hand, maybe we _could_ cache those,
if only we can distinguish them), we need to be able to generate and
save the cached material appropriately, we need to make sure it's
invalidated properly, and we need to be able to do mass invalidation
when, for instance, the software is upgraded. Cached pages may be kept
in files, rather than the database.
I should point out that while there are several possible choices here,
any of them is better than what we're running now. We need living,
running _code_, which can then be improved upon later.
* The page saving code is rather inefficient, particularly with how it
deals with the link tables (and potentially buggy -- sometimes pages end
up with their link table entries missing, possibly due to the system
timing out between the main save chunk and the link table update). If
someone would like to work on this, it would be very welcome. Nothing
that needs to be _discussed_, it just needs to be _done_ and changes
checked in.
* Various special pages are so slow they've been disabled. Most of them
could be made much more efficient with better queries and/or by
maintaining summary tables. Some remaining ones are also pretty
inefficient, like the Watchlist. Someone needs to look into these and
make the necessary adjustments to the code. Nothing to _chat_ about; if
you know how to make them more efficient, please rewrite them and check
in the _code_.
* Can MySQL 4 handle fulltext searches better under load? Is boolean
mode faster or slower? Someone needs to test this (Lee has a test rig
with mysql4 already, but as far as I know hasn't tested the fulltext
search with boolean mode yet), and if it's good news, we need to make an
upgrade a high priority. Not much to _chat_ about, it just needs to get
_done_.
* Alternately, would a completely separate search system (not using
MySQL) be more efficient? Or even just running searches on a dedicated
box with a replicated database to keep it from bogging down the main db?
Which leads us back to hardware...
For the server; I don't know what's going on here. What I do know is
that Jimbo posted this to wikitech-l in February:
-----Forwarded Message-----
From: Jimmy Wales <jwales(a)bomis.com>
To: wikitech-l(a)wikipedia.org
Subject: [Wikitech-l] Hardware inventory
Date: 07 Feb 2003 02:56:57 -0800
Jason and I are taking stock of our hardware, and I'm going to find a
secondary machine to devote exclusively to doing apache for wikipedia,
i.e. with no other websites on it or anything. I'll loan the machine
to the Wikipedia Foundation until the Foundation has money to buy a
new machine later on this year.
We'll keep the MYSQL where it is, on the powerful machine. The new
machine will be no slouch, either.
Today is Friday, and I think we'll have to wait for Jason to take a
trip to San Diego next week sometime (or the week following) to get
this all setup. (The machine I have in mind is actually in need of
minor repair right now.)
By having this new machine be exclusively wikipedia, I can give the
developers access to it, which is a good thing.
This will *not* involve a "failover to read-only" mechanism, I guess,
but then, it's still going to be a major improvement -- such a
mechanism is really a band-aid on a fundamental problem, anyway.
------
Lots of people think it's a good thing to set up mirror servers all
over the Internet. It's really not that simple. There are issues of
organizational trust with user data, issues with network latency, etc.
Some things should be decentralized, some things should be
centralized.
--- end forwarded message ---
and this to wikipedia-l in March:
-----Forwarded Message-----
From: Jimmy Wales <jwales(a)bomis.com>
To: wikipedia-l(a)wikipedia.org, wikien-l(a)wikipedia.org
Subject: [Wikipedia-l] Off today
Date: 19 Mar 2003 04:47:52 -0800
My wife and little girl are feeling ill today with a cold, so I'm
going to be taking off work to help out. I'm already a little behind
in wikipedia email, so I'll probably be slow for a few days as I dig
out.
We're getting a new (second) machine for wikipedia -- the parts have
been ordered and are being shipped to Jason, and then at some point
soon, he'll drive down to San Diego to install everything.
--Jimbo
--- end forwarded message ---
I e-mailed Jimbo and Jason the other day about this; I haven't heard
back from Jimbo, and Jason still doesn't know anything concrete about
the new server.
Jimbo, we really need some news on this front. If parts and/or a whole
machine really *is* on order and can be set up in the near future, we
need to know that. If it's *not*, then it may be time to pass around the
plate and have interested parties make sure one does get ordered, as had
begun to be discussed prior to the March 19 announcement.
-- brion vibber (brion @
pobox.com)