I have no idea about the schema changes, but to choose a digest for
detection of identity reverts is pretty simple. The really difficult
part is to choose a locally sensitive hash or fingerprint that works
for very similar revisions with a lot of content.
I would propose that the digest is stored in the database, and that a
lsh or fingerprint is calculated on the fly by the API, unless someone
can find a really good way to make and store a lsh or fingerprint that
has all necessary properties.
For all the purposes I know (and care) about the digest will be used
for detection of identity reverts, while the lsh/fingerprint will be
used for resynchronization after difficult partly reverts. In addition
it seems likely that fingerprints are necessary for more fine-grained
analysis.
It seems like the necessary properties for lsh and the fingerprint
scales with increasing content, that makes it difficult to precompute
a value.
John
On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:
On 28/11/11 08:29, Brion Vibber wrote:
I don't think it really "snuck", Rob has been talking about it for a
while, see e.g. comment 27.
Have we resolved the deployment questions on how
to actually do the change?
Just want to make sure ops has plenty of warning before 1.19 comes down the
pipe. (Especially if we have to revert anything back to 1.18 during/after!)
It can be deployed like any column addition to a large table: on the
slaves first, then switch masters, then on the old masters. For 1.17
we changed categorylinks (60M rows on enwiki), and that caused no
problems. In 1.18 the schema changes were done by ops (Asher), and
included flaggedrevs which is 30M rows on dewiki.
The revision table is 320M rows on enwiki, but it doesn't pose any
special challenges, as long as there's enough disk space. The snapshot
host db26 is the only host which may possibly be in danger of running
out of space, but if its snapshots are deleted and the space
reallocated to /a then it won't have any trouble.
Like the previous schema changes, this schema change will be done in
advance of the software version change. The old version will work with
the new schema, and the default value is harmless, so reverting back
to 1.18 or restarting the populate script won't be a problem.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l