On Tue, Sep 20, 2011 at 5:36 PM, Anthony <wikimail(a)inbox.org> wrote:
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon
It may or may not be an architecturally-better
design to have it as a
separate table, although considering how rapidly MW's 'architecture'
I'd say keeping things as simple as possible
is probably a virtue. But
is the basis on which we should be deciding it.
It's an intentional denormalization of the database done apparently
for performance reasons (although, I still can't figure out exactly
*why* it's being done as it still seems to be useful only for the dump
system, and therefore should be part of the dump system, not part of
mediawiki proper). It doesn't even seem to apply to "normal", i.e.
1) Those dumps are generated by MediaWiki from MediaWiki's database -- try
Special:Export on the web UI, some API methods, and the dumpBackup.php maint
2) Checksums would be of fairly obvious benefit to verifying text storage
integrity within MediaWiki's own databases (though perhaps best sitting on
or keyed to the text table...?) Default installs tend to use simple
plain-text or gzipped storage, but big installs like Wikimedia's sites (and
not necessarily just us!) optimize storage space by batch-compressing
multiple text nodes into a local or remote blobs table.
On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon
This is a big project which still retains
enthusiasm because we recognise
that it has equally big potential to provide interesting new features far
beyond the immediate usecases we can construct now (dump validation and
'something to do with reversions').
Can you explain how it's going to help with dump validation? It seems
to me that further denormalizing the database is only going to
*increase* these sorts of problems.
You'd be able to confirm that the text in an XML dump, or accessible through
the wiki directly, matches what the database thinks it contains -- and that
a given revision hasn't been corrupted by some funky series of accidents in
XML dump recycling or External Storage recompression.
IMO that's about the only thing it's really useful for; detecting
non-obviously-performed reversions seems like an edge case that's not worth
optimizing for, since it would fail to handle lots of cases like reverting
partial edits (say an "undo" of a section edit where there are other
intermediary edits -- since the other parts of the page text are not
identical, you won't get a match on the checksum).