Ævar Arnfjörð Bjarmason wrote:
like Reiserfs
would probably speed it up, and also speed up access to
the rest of the metadata in the database which would become much
smaller.
How about something like a version control system, subversion for
example, I don't know how it would do speed wise for something like
this but with that you'd get
* Free version control ( as in we wouldn't have to write our own
custom system to keep track of versions )
* Versions stored as diffs
It does have some drawbacks of course, for one it would be a central
repository so it would suffer from the current problems of having a
central database for wikitext, but there are distributed version
control systems available ( arch, monotone ), something like this
should at least be considered and tested too.
Using a VCS would actually be a step backwards. They are not designed
for enormously high rates of decentralized updates.
Traditional version control systems are all about centralization, and
have a model in which "files" "change" all the time, whilst attempting
to create a single global consistent state. This generates huge amounts
of admin and locking effort, and is very hard to scale. Wikipedia does
not need a single consistent global state, it only needs partial
consistency.
The fact that Wikipedia versions are immutable (and, as of 1.5, will all
have unique version IDs) completely eliminates the need for locking once
a version is written, and makes it easy to use a filesystem instead of a
database. ACID properties are then trivial except for the current "tip"
of the system state, for which metadata and locking can be handled by
the DB. For this reason, there's no need for a content management system
for the immutable content itself; the only things you have to track
versions of are the directories of pointers to versions, and the same
principle can be applied recursively if needed.
This is exactly the direction Linus Torvalds is moving in with git.
("Git" is a very fast and lightweight content-addressable version
management tool written by Linus Torvalds as an alternative to
conventional version control tools. not an insult).
In fact, the git approach of using SHA-1 hashes as "true" content IDs
has much to recommend it, as then there is no need even to use locking
to issue new version IDs, since they can be generated just by hashing
the source. This makes the git "filesystem" scalable over a huge number
of concurrent developers without any (realistic) chance of a clash; and
if two files do clash, you just add spaces to the end of the new file
until the clash goes away. Perhaps we should be looking at going in the
direction of git at the same time as we move the Wikitext out of the DB?
We can still use the git approach at the same time as doing the
block-compression trick, which eliminates the need for diffs at the same
time as maintaining robustness. Remember, diffs are cheap to manufacture
from decompressed text, and following diff-chains to recreate text is
expensive.
-- Neil