Simetrical <Simetrical+wikitech@...> writes:
On 9/12/06, Aerik Sylvan <aerik@...> wrote:
So, *if* throwing sql at the existing table is too slow (and it looks like it probably is), then the next alternative is to create records for the intersections themselves, so the retrieval query is a simple one.
Common queries would need to be stored and maintained separately, yes. This is fine and dandy as long as no one tries exotic intersections, it wouldn't be too much more load than the current system, but that's an empirical question whose answer is unknown and can't be assumed.
Well, I'm actually proposing to write all the non-empty interesections (up to some limit) into the database *IF* throwing sql at the existing structure is too inefficient. These would be written when a page is saved (perhaps only if the categories in the page have changed?)
So you can see that the number of combinations would get to be quite a lot. I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the categorylinks table? Even a thousand categories (and Wikipedia has far more) would work out to a *googol cubed* different combinations. Not only is there not enough memory and disk space on the planet to store that, there aren't that many *particles in the universe*.
Am I misunderstanding you? Maybe only non-empty category intersections? That requires slightly more thought as to scalability. If you have a million pages, and each has two random categories, there can be a maximum of a million intersections (if no two pages have the same two categories). If you bump that up to three categories, however, the maximum possible number is three million; four is six million, etc. It's O(m*2^n), or linear-exponential time. That's also far too high to be acceptable, unless you can guarantee that n remains very small (which you can't).
It's not exponential actually (because we're only taking distinct *unordered* calculations), after a bunch of research I found out that the number of intersections is calculated in math-speak as "n choose k" (see http://mathworld.wolfram.com/BinomialCoefficient.html) and the way to calculate it (for each k value, which is the number of intersecting categories) is n!/((n-k)!k!). So for our purposes, I did a quick sampling of 10 English Wikipedia pages, and got an average of 3.8 categories per page. Let's round up and say 4. If we want to store intersections (with a limit of 4 intersecting categories), it would be the sum of the little equation for values of k=2, k=3, and k=4 (and n=4). This is 11 intersections, in addition to the single categories we're already storing.
If the page has 5 categories, the numbers go up, and instead of 11 we'd have to store 25 categories. (By corollary, we are doing 11 or 25 extra INSERT statements, respectively...) So, if we had a histogram of categories for wikipedia, we could predict quite accurately how many non-empty intersections there would be. I'd guess between 11 and 25 million. What's the retrieval time for a simple select query against 25 million records?
Basically, the most practical solution in SQL would probably be to pick the ten thousand or hundred thousand or million most-visited intersections and maintain them as virtual categories, so they can be retrieved in the same time as normal categories. The only question is what percentage of queries this will deal with, which (as I said) is unknown at this point. Domas seemed rather skeptical of the efficiency of even this scheme, and given your benchmarks in a much smaller wiki his skepticism seems quite warranted.
I suppose we could come up with some scheme to do this - cache intersections for a period of time, then keep the ones that are popular... The only thing is, by definition, this would be at query time, and it would take some figure around (probably greater than) 2 tenths of a second for each one, until it is cached. Perhaps that's the way to do it - launch it as a beta feature, fill up the cache, then release it fully.
Brion's idea of a full-text search seems to be the way to go at this point, at least to me. After all, searching for "apple pie" does nothing but give you the intersection of articles containing the words "apple" and "pie", and we already do that, so it can't be overwhelmingly expensive, I suppose. (But a disclaimer: I am not, in fact, particularly knowledgeable in *any* of this stuff.)
On 9/12/06, David Gerard <dgerard@...> wrote:
Well, yes ... How does other software with arbitrary tags do it?
And that's a complementary way to go.
Yeah, but that's the exact same problem - how does a full text search / tagging engine do it? It's facing the same issues we're talking about. The only think I can think of is they are using a completely different data structure (and that would be?) and/or a faster (compiled?) language.
Best Regards, Aerik
P.S. Sorry for the long post, but I wanted to leave everything in context.