Hi,
I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.
In a general sense, there are two use cases for this new field: 1) Checking the validity of the XML dump files 2) Identifying reverts
I have started to work on a proposal for deployment (and while being incomplete) it might be a good start to further plan the deployment. I have been trying to come up with some back-of-the-envelope calculations about how much time and space it would take but I don't have all the required information yet to come up with some reasonable estimates.
You can find the proposal here: http://strategy.wikimedia.org/wiki/Proposal:Implement_and_deploy_checksum_re...
I want to thank Brion and Asher for giving feedback on prior drafts. Please feel free to improve this proposal.
Best, Diederik
PS: not sure if this proposal should be on strategy or mediawiki...
On 2011-09-03, at 7:16 AM, Daniel Friesen wrote:
On 11-09-02 09:33 PM, Rob Lanphier wrote:
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes. ... Rob
We've got 5 normal indexes on revision:
- A unique int+int
- A binary(14)
- An int+binary(14)
- Another int+binary(14)
- And a varchar(255)+binary(14)
That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1) may do.
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l