Today Friday, the front page of the English Wikipedia has been fast
all day.
Another page (I monitor http://www.wikipedia.com/wiki/Sweden) was slow
for one period of 30 minutes (09:30-10:00 am GMT) and another period
of two hours (11:40-13:50 GMT). Some other URLs on the international
Wikipedias were also affected at the same time. This might be due to
maintenance or work being done on the scripts.
Subtract 7 hours from GMT to get the server's local time zone
(PDT = GMT -0700).
Apart from these two limited intervals, every URL that I monitor have
been fast all day, including the recent changes pages.
I'm very happy with this, and hope Brion and Jimmy (and who else?)
will soon get the talk namespace links back without hurting
performance. (But hey, never make big fixes five minutes before you
leave for the weekend! Better just leave it as is if you have to go.)
And now for some more relaxed Friday reading, actually related to
performance problems. (The following analysis might be politically
slanted. Don't take it too seriously.) The Swedish parliament
elections are coming up in September, so the political parties are
starting up their campaigns. The problem is there are no big issues
to fight about. The four non-socialist parties have unusually boring
candidates (Dukakis style), and everybody expects the current
social-democratic government to win. The single issue that seems to
be coming up is the national sick leave insurance, which is paid by
tax money, and far over budget. This is linked to the fact that
"burn-out" is now an accepted medical diagnosis for which you are
allowed to take a long sick leave on the tax payers' expense. You
would expect such welfare excesses to be on the social democrat
agenda, and that non-socialists would urge for tax cuts and a balanced
budget. However, the current s-d govt has been doing a great job
balancing the budget, and they will now have to deal with cutting back
this overgenerous sick leave compensation without hurting their
voters' feelings. Tough job. The Christian-democratic party's
candidate has already hurt a lot of feelings by claiming that "some"
of those receiving compensation are "cheating the system". That might
be true, but accusing "some" (who? me?) is obviously not the way to
attract voters. This issue now has media attention and some
interesting example cases are reported.
Like this one: Attorneys in Swedish district courts have been
right-sized in the past years, as part of balancing the budget. This
means that as soon as one gets sick, the rest get too much to do,
leading to stress and burn-out, which leads to more sick leaves.
Think of the court cases as HTTP requests arriving to Wikipedia.
There are some processes/attorneys there to handle the cases, but for
some reason one process gets blocked and cannot work. This leaves
more work for the remaining workers, but they are probably waiting for
the first process to get finished and unlock the resources (database
records?) that it is using. If processes are allowed to go to sleep
waiting for each other, the work will pile up. It will never end.
So, what is the solution? Throwing more attorneys at the problem?
Maybe, but more likely the work processes should be redesigned and
simplified. That allows the available attorneys to finish up a case
and take on the next one. Some of their tasks are more important than
others, but the performance or throughput of the system depends on
cutting away or redesigning the most time-consuming tasks. The high
degree of sick-leave is an indicator of system design flaws (albeit an
one), and thus not altogether bad.
In the same way, a high "load average" (as reported by the "uptime" or
"top" commands) is one indicator that the Wikipedia system is flawed.
The load average in a UNIX system is the number of processes that are
ready to run, waiting for the CPU to become available. Unfortunately,
most of them are just waiting to see if their wanted resource has
become available. If this is not the case (e.g. database record still
locked), they will go back to the end of the line, waiting again. Do
you remember those bread shop waiting lines in Soviet Russia?
Training new attorneys is in itself a time-consuming task, which
should be avoided if possible. Instead of paying sick leave (for how
long?) to the already trained attorneys, a "cure" for "burn-out"
should be found that can bring them back to work, thus relieving the
overload from their colleagues and saving tax payers' money at the
same time.
I have no idea how a "cure" for burn-out can be found, but I think it
is a necessary political trick, and thus will happen. It will not
hurt voters' feelings, and it is my guess that the people who can
achieve this will work for the winners of the election.
This might be the weakest analogy in history, but I think we should
treat the Wikipedia processes with the same dignity and respect that
the Swedish voters would expect. After all, they're supposed to work
for us. The processes feel self-fulfillment when they can finish
their job on time, and get distressed when they get locked up. Any
uncalled for delay will only result in more work piling up. That is a
flaw in the system design that has to be fixed, and we cannot go
around claiming that "some" of the workers are trying to cheat the
system. That will only lead to us losing their confidence.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik
Teknikringen 1e, SE-583 30 Linuxköping, Sweden
tel +46-70-7891609
http://aronsson.se/http://elektrosmog.nu/http://susning.nu/
The difflib code which I took from phpwiki can be used for stream
based diffs (in fact, we do that: we first use a line based diff to
figure out the lines to present, and then we do a word based diff on
those lines in order to mark the changed words red.)
The code works as follows (see
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/wikipedia/phpwiki/fpw/diffli…):
The main workhorse is the Diff class. You construct it with two arrays
of strings that you want to compare. You then get a list $this->edits;
this is a list of _DiffOp's which describe how to get from the first
array of strings to the second array of strings with a minimal number
of changes. With that, you can do whatever you want. Normally, you'd
pass such an Diff object to a DiffFormater (which you should extend),
which then looks at the edits list and produces some output.
Currently, we cut the two article version into lines, pass those two
arrays to Diff, then to TableDiffFormatter for presenting in a table.
This TableDiffformatter, when printing out changed lines, calls a
WordLevedDiff (which extends Diff) to color changed words in red.
It might be nice to change TableDiffFormatter to produce side-by-side
output, similar to what the sourceforge cvs viewer does.
Axel
The next major issue to tackle with the new code is diffs. What I'd
really like to find is a stream-oriented difference algorithm rather
than a line-by-line one. I'm not familiar enough with the existing
difflib to know if it could be used that way--perhaps its contributor
could point me to some documentation on it?0
The new codebase running on http://www.piclab.com/newwiki/wiki.phtml
is now ready for some real testing. I intend to keep this one
"stable" as I add the remaining features--I have another experimental
setup to use during development.
It now has the basic features needed for view and editing articles,
including redirects, article histories, recent changes, and all the
user login functions. It has the full wikipedia database from 5/20.
So I'd appreciate it if you could take a minute and test the existing
features and tall me (1) if anything is broken, and (2) if the small
changes I've made to aren't so small in you opinion. Don't bother
telling me what's missing--I know most of it still is.
0
After half an hour of trying to access Procellariiformes to see what someone
else added to it, I get this:
Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-newest/work-http/wikiUser.php on line 24
Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-newest/work-http/wikiUser.php on line 32
Warning: Cannot add header information - headers already sent by (output
started at /home/wiki-newest/work-http/wikiUser.php:24) in
/home/wiki-newest/work-http/wiki.phtml on line 48
Warning: Cannot add header information - headers already sent by (output
started at /home/wiki-newest/work-http/wikiUser.php:24) in
/home/wiki-newest/work-http/wiki.phtml on line 50
Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-newest/work-http/databaseFunctions.php on line 33
What happened?
phma
I've spent more time than I care to admit loading, dumping,
reloading, tranforming, testing, reloading...various wikipedia
databases before settling on what I think the new format will
be, but I made a discovery along the way that might be useful:
The 05/20 database dump from wikipedia weighs in at close to
600 MB. It turns out that almost 200 MB of that is cache. In
the new system, I'll write a function specifically for doing
database dumps, but in the meantime I'd suggest that the next
time you dump a tarball, clear the cache first (and don't
forget to be careful of the timestamps when you do).
0
>replacing the odd linked-list mechanism of
>old page versions seems awkward. It could be replaced by simply
>using revision numbers that would be common to both the "cur" and
>"old" tables.
Yes, I think the linked-list approach should go. Especially since the
current codebase doesn't seem to use it. We saw a while ago that the
[[Pim Fortuyn]] article has a broken linked list, probably produced by
some database timeout, and yet the history page shows up fine, only
the diffs are computed incorrectly.
Every article should have a clean revision number, period. That way,
there would be an unambiguous way to the n-th version of some article.
The current mixture of oldid and version (look at the links on a
history page) is a mess.
Axel
I have set up the test site at http://test-de.wikipedia.com/, I had to
remove an insert from the newiki.sql because it was causing an error.
Of course, I lost the exact text of the error somewhere, but it
basically said that there was already an entry for Main_Page (the
entry I removed looked like it may have been a discussion forum for
the actual main page). However, it was a fairly large chunk. So,
take a look.
A diff to my modified newiki.sql is attached.
--
"Jason C. Richey" <jasonr(a)bomis.com>