[Wikitech-l] Database design

12 Jul 2003


      Now that my own proposal to re-write the entire code in Perl again was 
met with great resistance, I propose to at least create an entirely new 
database structure, and then adapt the current code to it. I have 
studied the current database structure and see the following rather 
severe problems with it:
* BLOBs that store article text are combined in the same table as 
meta-data (e.g. date, username of a change, change summary, minor flag, 
etc.). This is bad because variable-length fields like BLOBs negatively 
affect the performance of reading the table. Pages like the watchlist 
should not have to bother with variable-length data such as article text 
and would run a lot faster if they could get their data entirely from 
fixed-length rows.
* Currently, all user properties and preferences, as well as article 
properties, are columns in a table. Although this is not a problem in 
terms of DB-reading performance, introducing a new user preference or 
other enhancements involve adding a new column, and adding a new column 
becomes extremely database-intensive as the database grows. LiveJournal 
uses a very clever system that will easily remedy this. One table, 
'userproplist', stores the possible user properties (userprops), and 
another table, 'userprop', stores what user has what userprop with what 
value. This way all that is needed for adding a new userprop is adding a 
single row to the 'userproplist' table. The same would analogously apply 
to articleprops. Once we have that, the user table will hopefully remain 
very small (in terms of number of columns), so looking up a username (to 
name just an example) would be ridiculously efficient.
* BLOBs that fulfill the same function (article text) are scattered 
across two tables (cur and old). This is bad because it means 
variable-length text has to be moved across tables on every edit. Very 
slow. Better to give every version/revision of an article (i.e. each 
item of article text) a permanent ID and use those IDs in the 'cur' and 
'old' tables instead. Then have one large table, 'articletext' perhaps, 
mapping the IDs to their actual BLOBs. This eliminates the need to ever 
delete a BLOB (except perhaps when actually deleting an article with all 
its history, which is rare enough). Additionally, there isn't really a 
need for separate 'cur' and 'old' tables, especially when MemCacheD can 
take care of most recent versions.
* You are using a 'recentchanges' table which, I presume, gets also 
updated with every edit. This, I assume is the idea behind it, allows 
the 'Recent Changes' page to quickly grab the most recent changes 
without having to find them elsewhere in the DB. Contrary to intuition, 
this is a bad idea. It is always a better idea to optimise for less DB 
writes even if it means a few more DB reads, because writes are so much 
slower. (I am so sure of this because LiveJournal has made these 
experiences with their "Friends Page": grabbing entries from all the 
friends' journals all over the place in the DB is faster than updating a 
"hint table" with every newly created entry.)
In addition to these existing problems, of course there are things the 
database cannot currently handle, but were planned. While we're changing 
the DB, we could also add the following functionality to it:
* Store translated website text, so translators don't have to dig 
through PHP code and submit a file to the mailing list.
* A global table for bidirectional inter-wiki links. People should not 
have to add the same link to so many articles. In fact, taking this a 
step further, people should not even have to enter text like 
'[[fr:démocratie]]' into the article text when it's not part of the 
article text. There should be drop-downs underneath the big textbox 
listing languages, and little text boxes next to them for the target 
article name.
Are you all still convinced that adapting the current code to all these 
radical changes is easier than rewriting it all from scratch? :-)
Anyway. I'm tired.
I'm going to bed.
Good night.
Timwi

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Database design