On 9/12/06, Aerik Sylvan aerik@thesylvans.com wrote:
So, *if* throwing sql at the existing table is too slow (and it looks like it probably is), then the next alternative is to create records for the intersections themselves, so the retrieval query is a simple one.
Common queries would need to be stored and maintained separately, yes. This is fine and dandy as long as no one tries exotic intersections, it wouldn't be too much more load than the current system, but that's an empirical question whose answer is unknown and can't be assumed.
So you can see that the number of combinations would get to be quite a lot. I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the categorylinks table? Even a thousand categories (and Wikipedia has far more) would work out to a *googol cubed* different combinations. Not only is there not enough memory and disk space on the planet to store that, there aren't that many *particles in the universe*.
Am I misunderstanding you? Maybe only non-empty category intersections? That requires slightly more thought as to scalability. If you have a million pages, and each has two random categories, there can be a maximum of a million intersections (if no two pages have the same two categories). If you bump that up to three categories, however, the maximum possible number is three million; four is six million, etc. It's O(m*2^n), or linear-exponential time. That's also far too high to be acceptable, unless you can guarantee that n remains very small (which you can't).
Basically, the most practical solution in SQL would probably be to pick the ten thousand or hundred thousand or million most-visited intersections and maintain them as virtual categories, so they can be retrieved in the same time as normal categories. The only question is what percentage of queries this will deal with, which (as I said) is unknown at this point. Domas seemed rather skeptical of the efficiency of even this scheme, and given your benchmarks in a much smaller wiki his skepticism seems quite warranted.
Brion's idea of a full-text search seems to be the way to go at this point, at least to me. After all, searching for "apple pie" does nothing but give you the intersection of articles containing the words "apple" and "pie", and we already do that, so it can't be overwhelmingly expensive, I suppose. (But a disclaimer: I am not, in fact, particularly knowledgeable in *any* of this stuff.)
On 9/12/06, David Gerard dgerard@gmail.com wrote:
Well, yes ... How does other software with arbitrary tags do it?
And that's a complementary way to go.