Now that my own proposal to re-write the entire code in Perl again was
met with great resistance, I propose to at least create an entirely new
database structure, and then adapt the current code to it. I have
studied the current database structure and see the following rather
severe problems with it:
* BLOBs that store article text are combined in the same table as
meta-data (e.g. date, username of a change, change summary, minor flag,
etc.). This is bad because variable-length fields like BLOBs negatively
affect the performance of reading the table. Pages like the watchlist
should not have to bother with variable-length data such as article text
and would run a lot faster if they could get their data entirely from
fixed-length rows.
* Currently, all user properties and preferences, as well as article
properties, are columns in a table. Although this is not a problem in
terms of DB-reading performance, introducing a new user preference or
other enhancements involve adding a new column, and adding a new column
becomes extremely database-intensive as the database grows. LiveJournal
uses a very clever system that will easily remedy this. One table,
'userproplist', stores the possible user properties (userprops), and
another table, 'userprop', stores what user has what userprop with what
value. This way all that is needed for adding a new userprop is adding a
single row to the 'userproplist' table. The same would analogously apply
to articleprops. Once we have that, the user table will hopefully remain
very small (in terms of number of columns), so looking up a username (to
name just an example) would be ridiculously efficient.
* BLOBs that fulfill the same function (article text) are scattered
across two tables (cur and old). This is bad because it means
variable-length text has to be moved across tables on every edit. Very
slow. Better to give every version/revision of an article (i.e. each
item of article text) a permanent ID and use those IDs in the 'cur' and
'old' tables instead. Then have one large table, 'articletext' perhaps,
mapping the IDs to their actual BLOBs. This eliminates the need to ever
delete a BLOB (except perhaps when actually deleting an article with all
its history, which is rare enough). Additionally, there isn't really a
need for separate 'cur' and 'old' tables, especially when MemCacheD can
take care of most recent versions.
* You are using a 'recentchanges' table which, I presume, gets also
updated with every edit. This, I assume is the idea behind it, allows
the 'Recent Changes' page to quickly grab the most recent changes
without having to find them elsewhere in the DB. Contrary to intuition,
this is a bad idea. It is always a better idea to optimise for less DB
writes even if it means a few more DB reads, because writes are so much
slower. (I am so sure of this because LiveJournal has made these
experiences with their "Friends Page": grabbing entries from all the
friends' journals all over the place in the DB is faster than updating a
"hint table" with every newly created entry.)
In addition to these existing problems, of course there are things the
database cannot currently handle, but were planned. While we're changing
the DB, we could also add the following functionality to it:
* Store translated website text, so translators don't have to dig
through PHP code and submit a file to the mailing list.
* A global table for bidirectional inter-wiki links. People should not
have to add the same link to so many articles. In fact, taking this a
step further, people should not even have to enter text like
'[[fr:démocratie]]' into the article text when it's not part of the
article text. There should be drop-downs underneath the big textbox
listing languages, and little text boxes next to them for the target
article name.
Are you all still convinced that adapting the current code to all these
radical changes is easier than rewriting it all from scratch? :-)
Anyway. I'm tired.
I'm going to bed.
Good night.
Timwi