I've suggested to generate bulk checksums as well but both Brion and Ariel see the
primary purpose of this field to check the validity of the dump generating process and so
they want to generate the checksums straight from the external storage.
In a general sense, there are two use cases for this new field:
1) Checking the validity of the XML dump files
2) Identifying reverts
I have started to work on a proposal for deployment (and while being incomplete) it might
be a good start to further plan the deployment. I have been trying to come up with some
back-of-the-envelope calculations about how much time and space it would take but I
don't have all the required information yet to come up with some reasonable estimates.
You can find the proposal here:
I want to thank Brion and Asher for giving feedback on prior drafts. Please feel free to
improve this proposal.
PS: not sure if this proposal should be on strategy or mediawiki...
On 2011-09-03, at 7:16 AM, Daniel Friesen wrote:
On 11-09-02 09:33 PM, Rob Lanphier wrote:
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new
columns had partial indexes
(first 6 characters?)
Bug 2939 is one relevant bug to this, it could probably use
My understanding is that
having a normal index on a table the size of
our revision table will be far too expensive for db writes.
We've got 5 normal indexes on revision:
- A unique int+int
- A binary(14)
- An int+binary(14)
- Another int+binary(14)
- And a varchar(255)+binary(14)
That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1)
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name
Wikitech-l mailing list