Daniel P.B.Smith wrote:
My personal observations/superstitions, which are NOT based on any technical insight. My mental model is that different Wikipedia operations require different capabilities, and perhaps in some cases different servers, and therefore degrade differently on heavy load.
a) For some reason, reloading a page after saving an edit is one of the LEAST reliable operations and quickly degrades under load. The failure can take three different forms. The most common is that the browser times out. Lately, a very common one has been an error message saying the database is not available. A third form is that the supposedly edited page was reloaded, but shows the original text rather than the edited text.
That tallies pretty well with the technical situation. We have a single master database server and a number of slave servers. The slave servers repeat every write query executed on the master, but only after the query has finished executing on the master. There is only a single thread writing to each slave, so a long-running write causes every slave to lag behind the master. Most of the time this doesn't matter, because most reads can be taken from a copy of the database which is a few seconds old. But after a save, the next article view is taken from a slave, and that slave must have replicated writes up to and including your last slave. Apache will wait for about 10 seconds, before giving up and using the out-of-date copy. There should be a warning in the page footer. This behaviour has changed in the last few days, the other two scenarios are quite plausible.
In ALL THREE cases, there's much better than a 50% chance that the edit actually took place.
Sometimes the save may be rejected, if the apaches are too heavily loaded to accept the connection, but usually it will go through.
In ALL THREE cases, the most sensible thing to do is save the edited Wiki-marked-up text of the whole article somewhere locally, wait about five minutes, then try to view the page and see whether the edit took.
b) A very common symptom under heavy load is that actions "take," but do not become "visible" for many minutes. For example, when preferences change spontaneously (an infuriating thing which hits me about every two weeks) and I change them back, my changes USUALLY take effect--but do not become visible for many minutes afterwards. Until I discovered this, I was utterly baffled because I would keep trying various things to "fix" the problem, every possible combination of clearing caches I could think of and when the changes took several minutes later I had tried several other things and naturally assumed that it was the last thing I'd tried, rather than the first.
c) For reasons that baffle me, the "Go" button creates some kind of search string and does some kind of search. Under heavy load, all search operations degrade. This means that manually typing
http://en.wikipedia.org/wiki/Foobar
is much more reliable in retrieving the Foobar article than typing "Foobar" into the search box and pressing "Go."
When things are slow, I usually copy the http://en.wikipedia.org/wiki/ text out of the address box for handy pasting in later.
Go is quite a bit faster than search if it finds a match, it just checks for the existence of the article with a few different capitalisations. However it's necessarily slower than just typing the URL.
d) I am trying to learn to welcome the slowdowns as nature's way of reminding me that I'm spending way too much time on Wikipedia. Of course, the most infuriating situations are the ones when I MEAN to say "this article should definitely not be," press SAVE, see that I accidentally typed "this article should definitely be", press EDIT, fix it, and get a database error. In these situations, I console myself with the thought that nobody really cares that much about my opinion.
It would be all too easy for me to take notice of complaints like these, and put in a lot of work. But I have PhD work to do, I have to be strong.
-- Tim Starling