On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier robla@wikimedia.org wrote:
I generally suspect that a standard index is going to be a waste for the most urgent uses of this. It will rarely be interesting to search for common hashes between articles. The far more common case will be to search for duplicate hashes within the history of a single article. My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes.
If it's useful for Wikimedia and the index is expensive to maintain, the index can be kept on only one slave per DB cluster to minimize cost, the way we've always done for RC. If it's not useful for Wikimedia, only third-party researchers, then we could leave it unindexed on the cluster and have an index only on the toolserver or whatnot.
It's not so much that an extra index would be untenably expensive in principle, more that it's not worth the cost if it's not used for anything important. If it's just for statistical analysis and not for the live site, there's no reason to have it on all servers, except maybe administrative simplicity. The toolserver already has a bunch of indexes that Wikimedia doesn't, for exactly this sort of reason.
This is the first I've heard of partial indexes (/me researches) I don't know if a partial index is going to be cheap enough that we can use it, and useful enough that we'd want to. Would this be a faster query in a world with a partial index on the first six characters? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'
I'd think so, yes. MySQL should be smart enough to use partial indexes at least that far -- otherwise there'd be no point in supporting them. I'd think an index on the first several bytes should be almost as effective as one on the whole value, if you just want to do filtering or joining (not sorting). It might be more effective, since more would fit in RAM. However, I don't know for sure offhand.
...and would either of these queries be considered too expensive to run without a partial index? How about with a partial index?
Without a suitable index, running either of these queries would scan the entire revision history of the article in question. That would certainly not be acceptable on the cluster if the query is to be run with any frequency on arbitrary articles. It would be okay if it were just an occasional thing run on a limited number of articles, or only on articles with few revisions in their histories.
All of the above should be interpreted with the understanding that I stand a decent chance of knowing what I'm talking about, but have no say at this point and shouldn't be taken too seriously. :)