I have no idea about the schema changes, but to choose a digest for detection of identity reverts is pretty simple. The really difficult part is to choose a locally sensitive hash or fingerprint that works for very similar revisions with a lot of content.
I would propose that the digest is stored in the database, and that a lsh or fingerprint is calculated on the fly by the API, unless someone can find a really good way to make and store a lsh or fingerprint that has all necessary properties.
For all the purposes I know (and care) about the digest will be used for detection of identity reverts, while the lsh/fingerprint will be used for resynchronization after difficult partly reverts. In addition it seems likely that fingerprints are necessary for more fine-grained analysis.
It seems like the necessary properties for lsh and the fingerprint scales with increasing content, that makes it difficult to precompute a value.
John
On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling tstarling@wikimedia.org wrote:
On 28/11/11 08:29, Brion Vibber wrote:
So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
I don't think it really "snuck", Rob has been talking about it for a while, see e.g. comment 27.
Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)
It can be deployed like any column addition to a large table: on the slaves first, then switch masters, then on the old masters. For 1.17 we changed categorylinks (60M rows on enwiki), and that caused no problems. In 1.18 the schema changes were done by ops (Asher), and included flaggedrevs which is 30M rows on dewiki.
The revision table is 320M rows on enwiki, but it doesn't pose any special challenges, as long as there's enough disk space. The snapshot host db26 is the only host which may possibly be in danger of running out of space, but if its snapshots are deleted and the space reallocated to /a then it won't have any trouble.
Like the previous schema changes, this schema change will be done in advance of the software version change. The old version will work with the new schema, and the default value is harmless, so reverting back to 1.18 or restarting the populate script won't be a problem.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l