On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier <robla(a)wikimedia.org> wrote:
I generally suspect that a standard index is going to
be a waste for
the most urgent uses of this. It will rarely be interesting to search
for common hashes between articles. The far more common case will be
to search for duplicate hashes within the history of a single article.
My understanding is that having a normal index on a table the size of
our revision table will be far too expensive for db writes.
If it's useful for Wikimedia and the index is expensive to maintain,
the index can be kept on only one slave per DB cluster to minimize
cost, the way we've always done for RC. If it's not useful for
Wikimedia, only third-party researchers, then we could leave it
unindexed on the cluster and have an index only on the toolserver or
It's not so much that an extra index would be untenably expensive in
principle, more that it's not worth the cost if it's not used for
anything important. If it's just for statistical analysis and not for
the live site, there's no reason to have it on all servers, except
maybe administrative simplicity. The toolserver already has a bunch
of indexes that Wikimedia doesn't, for exactly this sort of reason.
This is the first I've heard of partial indexes
(/me researches) I
don't know if a partial index is going to be cheap enough that we can
use it, and useful enough that we'd want to. Would this be a faster
query in a world with a partial index on the first six characters?
SELECT rev_id FROM revision WHERE rev_page=12345 AND
I'd think so, yes. MySQL should be smart enough to use partial
indexes at least that far -- otherwise there'd be no point in
supporting them. I'd think an index on the first several bytes should
be almost as effective as one on the whole value, if you just want to
do filtering or joining (not sorting). It might be more effective,
since more would fit in RAM. However, I don't know for sure offhand.
...and would either of these queries be considered too
run without a partial index? How about with a partial index?
Without a suitable index, running either of these queries would scan
the entire revision history of the article in question. That would
certainly not be acceptable on the cluster if the query is to be run
with any frequency on arbitrary articles. It would be okay if it were
just an occasional thing run on a limited number of articles, or only
on articles with few revisions in their histories.
All of the above should be interpreted with the understanding that I
stand a decent chance of knowing what I'm talking about, but have no
say at this point and shouldn't be taken too seriously. :)