In the past, I found it useful to be able to use "and not" in my search (in
particular to filter out the Rambot pages when looking for years). I recently
found that it does not work - or more precisely, that it does not work on the
English Wikipedia. It does work on the others (checked only Dutch and German).
What is going on, and why?
After getting back into wikiland catching up with wikipedia-l
was pretty easy, but catching up with the wikitech list took a
little longer. It seems you guys have had interesting times
lately (in the Chinese curse sense). Sorry I abandoned you,
but you guys do seem to have risen to the challenge.
Magnus did a great service by giving us code with features
that made Wikipedia usable and popular. When that code bogged
down to the point where the wiki became nearly unusable, there
wasn't much time to sit down and properly architect and develop
a solution, so I just reorganized the existing architecture for
better performance and hacked all the code. This got us over
the immediate crisis, but now my code is bogging down, and we
are having to remove useful features to keep performance up.
I think it's time for Phase IV. We need to sit down and design
an architecture that will allow us to grow without constantly
putting out fires, and that can become a stable base for a fast,
reliable Wikipedia in years to come. I'm now available and
equipped to help in this, but I thought I'd start out by asking
a few questions here and making a few suggestions.
* Question 1: How much time do we have?
Can we estimate how long we'll be able to limp along with
the current code, adding performance hacks and hardware to
keep us going? If it's a year, that will give us certain
opportunities and guide some choices; if it's only a month
or two, that will constrain a lot of those choices.
* Suggestion 1: The test suite.
I think the most critical piece of code to develop right now
is a comprehensive test suite. This will enable lots of
things. For example, if we have a performance question, I
can set up one set of wiki code on my test server, run the
suite to get timing data, tweak the code, then run the suite
again to get new timing. The success of the suite will tell
us if anything broke, and timing will tell us if we're on
the right track. This will be useful even during the
limp-along with current code phase. I have a three-machine
network at home, with one machine I plan to dedicate 100% to
wiki code testing, and my test server in San Antonio that we
can use. This will also allow us to safely refactor code.
I'd like to use something like Latka for the suite (see
* Question 2: How wedded are we to the current tools?
Apache/MySQL/PHP seems a good combo, and it probably would
be possible to scale them up further, but there certainly
are other options. Also, are we willing to take chances on
semi-production quality versions like Apache 2.X and MySQL 4.X?
I'd even like to revisit the decision of using a database
at all. After all, a good file system like ReiserFS (or to
a lesser extent, ext3) is itself a pretty well-optimized
database for storing pieces of free-form text, and there are
good tools available for text indexing, etc. Plus it's
easier to maintain and port.
* Suggestion 2: Use the current code for testing features.
In re-architecting the codebase, we will almost certainly
come to points where we think a minor feature change will
make a big performance difference that won't hurt usability,
or just features that we want to implement anyway. For
example, we could probably make it easier to cache page
requests if we made most of the article content HTML not
dependent on skin by tagging elements well and using CSS
appropriately. Also, we probably want to eventually render
valid XHTML. I propose that while we are building the
phase IV code, we add little features like this to the
existing code to guage things like user reactions and
Other suggestions/questions/answers humbly requested
(including "Are you nuts? Let's stick with Phase III!" if
you have that opinion).
Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
A wikipedian has recently been trying to find a good way to cite
particular revisions of articles in the bibliography for a paper.
Current we can give URLs for the _current_ version of an article
(current as of whenever it is visited), or of _previous_ versions (as of
when the citation was made):
There are two main problems with this (aside from the ugliness of the
* There is no way to reference the current version _as of the time of
citation_. Since that revision isn't in the old table, it has no oldid
* oldid values sometimes can change, as when an article is deleted and
subsequently restored (done also when recombining histories of articles
that have been broken by crude renaming). Possible rearrangements of the
database (such as combining all languages into a single table) could
require reassigning oldids en masse. They are *not* reliable long-term
One possible solution would be to provide a way of citing articles as of
a particular timestamp, for instance:
which would pull up either a cur or old version with that timestamp.
(It could also be prettified: version=2003-02-24-16:11:34 etc)
* consistent, no fuss, no worries about rearrangement of db structure
* citation URL can be provided in a nice handy link at the bottom of
* timestamp has 1-second resolution. Generally this is going to be
unique (at least per article), but it may occasionally not be,
particularly in cases of recombined histories. Some articles had
multiple revisions' timestamps set to the same time due to bugs in the
rename code and other db tweaks in early '02.
* for this reason it's not suitable as the mainline url for drawing up
old history revisions via the history list; so people have to remember
to find and use the citation url separately
Alternatively, we could supply _both_ timestamp and oldid in the URL,
and let timestamp have priority if an exact match on both is not found.
-- brion vibber (brion @ pobox.com)
Observing the Wikipedia moving up to a very high level of load today,
(loadav 10.39, 12.58, 13.45) it occurs to me that a "load shedding"
function would be useful, where requests may be bounced with an error
502: "*Service Temporarily Overloaded". This would have the effect of
dropping load until the server returns to normal load levels, preventing
A human-readable text should be added in the user's own language, saying
Wikipedia is experiencing very high load at the moment. We are taking
measures to control the load of the system. Please try your request
again in a few minutes when load should be lower".
Well-behaved spiders should any pages sent with 502 errors.
To prevent a sudden turn-on of error 502 for all users, with the
possibility of load oscillation, we could make the 502 errors
progressively more probable as the load increases beyond a certain
point. This could also be used to ensure that logged-on users and
"important" transactions suge as page edits maintain a higher QoS during
these periods, until load reaches the point at which even they have to
Brion Vibber wrote:
>On dim, 2003-02-16 at 20:11, s wrote:
>>Would it make sense, as I've been experiencing bandwidth limitations, on
>>occasion, (assuming the problem may be general) to limit, by some ratio, the
>>server request speed to anon users, and thereby allowing logged in users
>>some degree of greater access?
>I'm afraid there's not much we can do about your bandwidth limitations;
>that's between you and your ISP.
>(No, actually there is -- we could compress sent pages. I'll consider
>trying this in the future, once apache/php are reinstalled with gzip
>support built in. However this will use some cpu power, and I don't know
>how this will affect server speed at this point.)
>-- brion vibber (brion @ pobox.com)
Both CPUs are typically running about 75% busy at the moment. (50% user,
25% system), and quite often intermittently peaking to 100%.
However, system performance is nice and smooth nearly all the time, so
the worst database bottlenecks appear to have been ironed out.
When the system moves to separate machines for the database and
webserver, the load should fall dramatically: that might be the right
time to enable gzipping.
Don Marti, editor of Linux Journal, contacted me about doing an
article on wikipedia for LJ. I wrote back with a couple of
proposals, and they chose the one described here:
> Another article might be a technical article about how we're handling
> our growth using all open source software, and the specific software
> challenges of an open collaborative writing project. We could use
> more developers, and obviously an interesting article about our
> software might bring us some wonderful new talent.
> I'm the guy to write the first article, but I'm not our best technical
> guy, so I'd want to co-author the second article with one of our
> developers, which might take a bit longer.
Would someone like to volunteer to co-author the second article?
It probably makes the most sense to have someone do this who has
really dirty hands from the code. My guess is that an LJ publication
is a nice resume enhancer.
Articles names of special topics often contain the topic in parenthesis for
disambiguation. The "pipe trick" automatically hides the stuff in parentheses
but if I have a lot of links between pages of the same topic I still have to
type the stuff in the parenthesis :-(
For instance in an article about "tree (graph theory)":
A '''tree''' is a [[graph (graph theory)|]]
that is [[connected (Graph theory)|]] and has
no simple [[cycle (Graph theory)|]]s.
I'd like a little hack that automatically adds the same topic that is used in
the article title. This could be done via empty parenthesis because there is no
article that has "()" in its title. In the example above:
A '''tree''' is a [[graph()|]]
that is [[connected()|]] and has
no simple [[cycle()|]]s.
For multiple articles of one topic this would be really helpfull and prevents
adding links to disambiguation pages. What do you think?
thanks a lot,