I think we should add the GPL license to the cvs tree and add a
copyright notice to the top of the scripts as soon as possible (just
take any GNU program as template). Right now, the number of
contributors is still small and we can still make this change. Later
it may get ugly.
Axel
> SELECT * FROM cur WHERE (cur_title NOT LIKE "S") AND (cur_text LIKE "S" )
> ORDER BY cur_title
>seems to be a big offender. That's the special_dosearch query, I think.
Yes. That searches linearly through the whole database to search for a
string. That's a killer. It's like our old UseMod search before Jimbo wrote
an indexed one. A FULLTEXT index, together with MATCH instead
of LIKE, should give a major improvement.
Axel
When I check everything out of the CVS, I do it into a directory that has
nothing to do with the real site.
Then, I copy the files over to the proper location.
It seems that wikiText.php AND wikiTextEn.php are always different, and I have
to edit them... so really, I shouldn't be copying them unless there's a good reason,
right?
Here's my exact question:
wikiTextEn.php warns me:
# ATTENTION:
# To fit your local settings, PLEASE edit wikiText.php ONLY!
# Change settings here ONLY if they're to become global in all wikipedias!
But that seems a bit "opposite" to me... doesn't wikiTextEn mean "wikiText English"?
If so, then changes here should ONLY affect the English wikipedia, not "global in
all wikipedias"?
Also, whichever way it is supposed to be, I'm sure I should only have to edit one file.
But I have to edit two.
First, $wikiCurrentServer returns http://wikipedia.com in the default configuration, but
we prefer http://www.wikipedia.com/ (see line 12 of wikiTextEn.php, I always edit to
hardcode this.)
And on the next line, $wikiSQLServer is different locallly: the database is named
"wiki" instead of "wikipedia".
So, should I just add those two things to wikiText.php? And that will override the
stuff in wikiTextEn.php?
O.k., I've enabled this and glanced through it a bit. It's pretty
interesting, even though I only have a small handful of values in it.
I think what I'll do is leave it on overnight, and then gzip it for
everyone to download tomorrow. (Or perhaps I'll just gzip the output
of mysqldumpslow, which summarizes it a bit. Or perhaps I'll slap together
a quick perl script to boil it down to the essentials for our context.)
SELECT * FROM cur WHERE (cur_title NOT LIKE "S") AND (cur_text LIKE "S" )
ORDER BY cur_title
seems to be a big offender. That's the special_dosearch query, I think.
Also, here's a fun looking query:
SELECT * FROM cur WHERE (cur_title NOT LIKE "S") AND ( (cur_title LIKE "S" OR
cur_title LIKE "S") AND (cur_title LIKE "S" OR cur_title LIKE "S") AND (cur_titl
e LIKE "S" OR cur_title LIKE "S") AND (cur_title LIKE "S" OR cur_title LIKE "S")
AND (cur_title LIKE "S" OR cur_title LIKE "S") ) ORDER BY cur_title
I'm no database expert, but that doesn't look so sensible to me. :-)
I'm not sure if this will be fixed with the next bugfix, but I edited the
Main Page (twice) to remove the counter, but I still get the old cached page
("23.338 articles").
Browsing around on http://www.mysql.com/documentation/ and comparing
with the wikipedia.sql in cvs, I have the following suggestions:
1) To speed up searches, we should use a FULLTEXT index on title and
text, and then use the match operator. That should also yield more
relevant results.
(http://www.mysql.com/doc/F/u/Fulltext_Search.html)
2) In special_recentchanges.php, we select with "WHERE
cur_timestamp>$mindate", but cur_timestamp is not indexed. This
means that mysql linearly searches through the whole cur database,
everytime somebody views RecentChanges.
3) Assuming that php runs as an apache module, we should use
persistent database connections. That way, we won't repeatedly send
over the username and password; one connection is reused by apache
even after the php script dies.
(http://www.phpbuilder.com/manual/features.persistent-connections.php
and http://www.phpbuilder.com/manual/function.mysql-pconnect.php)
Axel
Jan Hidders wrote:
>From: "Brion L. VIBBER" <brion(a)pobox.com>
>
>>Right. Hmm, can you use TO_DAYS(cur_timestamp) or some such? Or is that
>>just going to cause problems?
>>
>
>Unfortunately MySQL only allows column names in the GROUP BY clause.
>
D'oh! Well then, two columns it is.
-- brion vibber (brion @ pobox.com)
Jan's change to the database has my blessing, as he is a (the!) database
expert. Anyway, it isn't *my* baby anymore, it is *ours*;) Yeah, the
blessings of modern cloning technology...
As for the cache flushing mechanism Brion proposed in a private mail, the
easiest way would be to link it to the cur_counter field; if "cur_counter
MOD 20 == 0", the cache could be flushed, best by setting the local $cache
variable to "" at the end of the load routine.
20 is just a wild guess...
Finally, thanks to Jimbo for setting up this mailing list.
Gee, I never caused a mailing list before ;)
Magnus
(moving thread to wikitech-l)
Jan Hidders wrote:
>From: "Brion L. VIBBER" <brion(a)pobox.com>
>
>>Jan Hidders wrote:
>>
>>>Dear fellow programmers,
>>>
>>>I saw that the SQL code for the Recent Changes page is rather inefficient
>>>and causes a lot of database access, so I decided to improve this.
>>>
>However,
>
>>>I can only do this properly if the timestamp field in the tables is split
>>>
>in
>
>>>a day and a time field.
>>>
>>Can I ask how exactly that would help? I'm not much of a database guru,
>>so the answer isn't obvious to me and I'm a bit curious.
>>
>
>It allows me to do a GROUP BY on the day. That way I can take a left outer
>join between the cur table and the old table and group on the combination of
>cur_day and cur_title. This allows me to get all the information I need for
>the page in one SQL statement.
>
Right. Hmm, can you use TO_DAYS(cur_timestamp) or some such? Or is that
just going to cause problems?
If it's more efficient to use a split timestamp, I can't come up with
any objection. But, it's Magnus' baby. :)
-- brion vibber (brion @ pobox.com)