Magnus Manske wrote:
AFAIK, it is still the same mechanism as in Phase II, but Brion works on on-the-fly compression for the old texts, as the old table gets quite large.
Magnus
The big win is not in compressing each individual version (version 6 of the article "London"), but in compressing the entire sequence of versions for each article, since so much is common between version 6, 7, and 8 of the same article.
This optimization is what RCS does by storing the current text in full and only the diffs that are needed to reproduce the next earlier version. Now, RCS has its roots in the 1970s and does this (1) in a text file, and (2) in a long sequence of diffs, which makes it very slow to extract version 1 of a text if the current version is 2314. I think that some of the more modern version control systems (?? aegis, arch, bitkeeper, darcs, perforce, subversion, ??) play around with hierarchical systems where every N:th version is stored in full.
Still, when restoring vandalism, version 6 and 8 might be identical, so storing the two diffs (back and forth) would be less than optimal. Further, when pieces of text are moved between two articles, the best compression would have to consider the whole table. Perhaps MySQL (or the underlying filesystem) should implement the compression.
I don't know if any existing version control system uses a relational database backend (MySQL, PostgreSQL, ...), but this would be an interesting combination independent of Wikipedia, so perhaps it should be develped as a generic component that can be used from Wikipedia as well as from other applications. Especially the way MediaWiki stores the changelog in a searchable relational table is a great improvement over primitive file-based systems such as RCS and CVS.