On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman <afeldman [at]
wikimedia>wrote:
Since the primary use case here seems to be offline analysis and it may
not
be of much interest to mediawiki users outside of wmf, can we store the checksums in new tables (i.e. revision_sha1) instead of running large alters, and implement the code to generate checksums on new edits via an extension?
Checksums for most old revs can be generated offline and populated before
the extension goes live. Since nothing will be using the new table yet, there'd be no issues with things like gap lock contention on the revision
table from mass populating it.
That's probably the simplest solution; adding a new empty table will be
very
quick. It may make it slower to use the field though, depending on what
all
uses/exposes it.
During stub dump generation for instance this would need to add a left
outer
join on the other table, and add things to the dump output (and also needs
an update to the XML schema for the dump format). This would then need to
be
preserved through subsequent dump passes as well.
-- brion
Can we resist the temptation to implement schema changes as new tables purely to make life easier for Wikimedia? Core schema changes are certainly enough of a hurdle to warrant serious discussion, but they are not the totally-intractable mess that they used to be. 1.19 already includes index changes to the user and logging tables; it will already require the full game of musical chairs with the db slaves. Implementing this as a new column does not actually make things any more complicated, it would just mean that an operation that would take three hours before might now take five.
It may or may not be an architecturally-better design to have it as a separate table, although considering how rapidly MW's 'architecture' changes I'd say keeping things as simple as possible is probably a virtue. But that is the basis on which we should be deciding it. This is a big project which still retains its enthusiasm because we recognise that it has equally big potential to provide interesting new features far beyond the immediate usecases we can construct now (dump validation and 'something to do with reversions'). Let's not hamstring it at birth based on the operational pressures of the one MediaWiki end user who is best placed to overcome said issues.
--HM
On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman <afeldman [at]
wikimedia>wrote:
Since the primary use case here seems to be offline analysis and it may
not
be of much interest to mediawiki users outside of wmf, can we store the checksums in new tables (i.e. revision_sha1) instead of running large alters, and implement the code to generate checksums on new edits via an extension?
Checksums for most old revs can be generated offline and populated before
the extension goes live. Since nothing will be using the new table yet, there'd be no issues with things like gap lock contention on the revision
table from mass populating it.
That's probably the simplest solution; adding a new empty table will be
very
quick. It may make it slower to use the field though, depending on what
all
uses/exposes it.
During stub dump generation for instance this would need to add a left
outer
join on the other table, and add things to the dump output (and also needs
an update to the XML schema for the dump format). This would then need to
be
preserved through subsequent dump passes as well.
-- brion
Can we resist the temptation to implement schema changes as new tables purely to make life easier for Wikimedia? Core schema changes are certainly enough of a hurdle to warrant serious discussion, but they are not the totally-intractable mess that they used to be. 1.19 already includes index changes to the user and logging tables; it will already require the full game of musical chairs with the db slaves. Implementing this as a new column does not actually make things any more complicated, it would just mean that an operation that would take three hours before might now take five.
It may or may not be an architecturally-better design to have it as a separate table, but that is the basis on which we should be deciding it. This is a big project which still retains enthusiasm because we recognise that it has equally big potential to provide interesting new features far beyond the immediate usecases we can construct now (dump validation and 'something to do with reversions'). Let's not hamstring it at birth based on the operational pressures of the one MediaWiki end user who is best placed to overcome said issues.
--HM
On Tue, Sep 20, 2011 at 12:37 PM, Happy Melon happy-melon@live.com wrote:
Can we resist the temptation to implement schema changes as new tables purely to make life easier for Wikimedia? Core schema changes are certainly enough of a hurdle to warrant serious discussion, but they are not the totally-intractable mess that they used to be. 1.19 already includes index changes to the user and logging tables; it will already require the full game of musical chairs with the db slaves.
One or 20 additional core schema changes batched together in 1.19 isn't an issue for us. It's more that if this is a change that is primarily of use for offline analysis or cases that generally wouldn't interest users outside of WMF, should it be in core? If MW can already detect reverts on edit save without sha1 hashes, it seem that may be the case. User stories should be better defined before deciding on implementation.
wikitech-l@lists.wikimedia.org