I think we should get a decent stress testing setup going, so that we
don't have to try out all improvements on the live site.
I assume we are all running Unix/apache/php/mysql. Siege at
http://joedog.org/siege/ looks like a good choice of a free stress
testing utility. It can simulate n users hammering your site
simultaneously, even if you aren't connected to any network, and
reports response times and other stats.
We need a realistically sized database dump to start playing though.
We could then come up with a nice little Siege script that loads
RecentChanges, searches and downloads a couple of articles, edits a
couple others (maybe modeled on real stats from Jimbo), share the
script and use it as benchmark to try out patches to the PHP script
and database scheme.
Axel
For proof of principle (and just for fun, of course!) I took a few hours to
write a simple single-pass parser for wikipedia pages in C++, using sql++
(the "official" MySQL API).
It supports external and internal links (with "does topic exist" check),
special line beginnings (:, *, #, and leading space), wiki-style bold and
italics, ---- things etc.
No == headings == (beheadings neither;), no namespaces, but these are simple
to implement.
Called from the shell (as I didn't connect it with the apache yet), it
renders the Main Page in 0:00.04 (without caching ;)
Should I continue to work on this, and eventually add it to the CVS, or is
it just a waste of time?
I will, of course, continue to work on the PHP script, no doubt about that!
Magnus
I think we should add the GPL license to the cvs tree and add a
copyright notice to the top of the scripts as soon as possible (just
take any GNU program as template). Right now, the number of
contributors is still small and we can still make this change. Later
it may get ugly.
Axel
> SELECT * FROM cur WHERE (cur_title NOT LIKE "S") AND (cur_text LIKE "S" )
> ORDER BY cur_title
>seems to be a big offender. That's the special_dosearch query, I think.
Yes. That searches linearly through the whole database to search for a
string. That's a killer. It's like our old UseMod search before Jimbo wrote
an indexed one. A FULLTEXT index, together with MATCH instead
of LIKE, should give a major improvement.
Axel
When I check everything out of the CVS, I do it into a directory that has
nothing to do with the real site.
Then, I copy the files over to the proper location.
It seems that wikiText.php AND wikiTextEn.php are always different, and I have
to edit them... so really, I shouldn't be copying them unless there's a good reason,
right?
Here's my exact question:
wikiTextEn.php warns me:
# ATTENTION:
# To fit your local settings, PLEASE edit wikiText.php ONLY!
# Change settings here ONLY if they're to become global in all wikipedias!
But that seems a bit "opposite" to me... doesn't wikiTextEn mean "wikiText English"?
If so, then changes here should ONLY affect the English wikipedia, not "global in
all wikipedias"?
Also, whichever way it is supposed to be, I'm sure I should only have to edit one file.
But I have to edit two.
First, $wikiCurrentServer returns http://wikipedia.com in the default configuration, but
we prefer http://www.wikipedia.com/ (see line 12 of wikiTextEn.php, I always edit to
hardcode this.)
And on the next line, $wikiSQLServer is different locallly: the database is named
"wiki" instead of "wikipedia".
So, should I just add those two things to wikiText.php? And that will override the
stuff in wikiTextEn.php?
O.k., I've enabled this and glanced through it a bit. It's pretty
interesting, even though I only have a small handful of values in it.
I think what I'll do is leave it on overnight, and then gzip it for
everyone to download tomorrow. (Or perhaps I'll just gzip the output
of mysqldumpslow, which summarizes it a bit. Or perhaps I'll slap together
a quick perl script to boil it down to the essentials for our context.)
SELECT * FROM cur WHERE (cur_title NOT LIKE "S") AND (cur_text LIKE "S" )
ORDER BY cur_title
seems to be a big offender. That's the special_dosearch query, I think.
Also, here's a fun looking query:
SELECT * FROM cur WHERE (cur_title NOT LIKE "S") AND ( (cur_title LIKE "S" OR
cur_title LIKE "S") AND (cur_title LIKE "S" OR cur_title LIKE "S") AND (cur_titl
e LIKE "S" OR cur_title LIKE "S") AND (cur_title LIKE "S" OR cur_title LIKE "S")
AND (cur_title LIKE "S" OR cur_title LIKE "S") ) ORDER BY cur_title
I'm no database expert, but that doesn't look so sensible to me. :-)
I'm not sure if this will be fixed with the next bugfix, but I edited the
Main Page (twice) to remove the counter, but I still get the old cached page
("23.338 articles").
Browsing around on http://www.mysql.com/documentation/ and comparing
with the wikipedia.sql in cvs, I have the following suggestions:
1) To speed up searches, we should use a FULLTEXT index on title and
text, and then use the match operator. That should also yield more
relevant results.
(http://www.mysql.com/doc/F/u/Fulltext_Search.html)
2) In special_recentchanges.php, we select with "WHERE
cur_timestamp>$mindate", but cur_timestamp is not indexed. This
means that mysql linearly searches through the whole cur database,
everytime somebody views RecentChanges.
3) Assuming that php runs as an apache module, we should use
persistent database connections. That way, we won't repeatedly send
over the username and password; one connection is reused by apache
even after the php script dies.
(http://www.phpbuilder.com/manual/features.persistent-connections.php
and http://www.phpbuilder.com/manual/function.mysql-pconnect.php)
Axel
Jan Hidders wrote:
>From: "Brion L. VIBBER" <brion(a)pobox.com>
>
>>Right. Hmm, can you use TO_DAYS(cur_timestamp) or some such? Or is that
>>just going to cause problems?
>>
>
>Unfortunately MySQL only allows column names in the GROUP BY clause.
>
D'oh! Well then, two columns it is.
-- brion vibber (brion @ pobox.com)