On Tue, Sep 20, 2011 at 5:36 PM, Anthony wikimail@inbox.org wrote:
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon happy-melon@live.com wrote:
It may or may not be an architecturally-better design to have it as a separate table, although considering how rapidly MW's 'architecture'
changes
I'd say keeping things as simple as possible is probably a virtue. But
that
is the basis on which we should be deciding it.
It's an intentional denormalization of the database done apparently for performance reasons (although, I still can't figure out exactly *why* it's being done as it still seems to be useful only for the dump system, and therefore should be part of the dump system, not part of mediawiki proper). It doesn't even seem to apply to "normal", i.e. non-Wikimedia, installations.
1) Those dumps are generated by MediaWiki from MediaWiki's database -- try Special:Export on the web UI, some API methods, and the dumpBackup.php maint script family.
2) Checksums would be of fairly obvious benefit to verifying text storage integrity within MediaWiki's own databases (though perhaps best sitting on or keyed to the text table...?) Default installs tend to use simple plain-text or gzipped storage, but big installs like Wikimedia's sites (and not necessarily just us!) optimize storage space by batch-compressing multiple text nodes into a local or remote blobs table.
On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon happy.melon.wiki@gmail.com wrote:
This is a big project which still retains enthusiasm because we recognise that it has equally big potential to provide interesting new features far beyond the immediate usecases we can construct now (dump validation and 'something to do with reversions').
Can you explain how it's going to help with dump validation? It seems to me that further denormalizing the database is only going to *increase* these sorts of problems.
You'd be able to confirm that the text in an XML dump, or accessible through the wiki directly, matches what the database thinks it contains -- and that a given revision hasn't been corrupted by some funky series of accidents in XML dump recycling or External Storage recompression.
IMO that's about the only thing it's really useful for; detecting non-obviously-performed reversions seems like an edge case that's not worth optimizing for, since it would fail to handle lots of cases like reverting partial edits (say an "undo" of a section edit where there are other intermediary edits -- since the other parts of the page text are not identical, you won't get a match on the checksum).
-- brion