On 9/13/06, Aerik aerik@thesylvans.com wrote:
It's not exponential actually (because we're only taking distinct *unordered* calculations), after a bunch of research I found out that the number of intersections is calculated in math-speak as "n choose k" (see http://mathworld.wolfram.com/BinomialCoefficient.html) and the way to calculate it (for each k value, which is the number of intersecting categories) is n!/((n-k)!k!).
Erk, you're right. I was too hasty. It's O(n choose k), but it's easily provable that that's the same as O(n^k), which is of course very slow for even smallish k's (although not ridiculously so like exponentials). On the other hand, if *n* remains small, say ten or whatever, then it's not such a major issue, I suppose.
So, if we had a histogram of categories for wikipedia, we could predict quite accurately how many non-empty intersections there would be. I'd guess between 11 and 25 million. What's the retrieval time for a simple select query against 25 million records?
Looking at it that way, I'll grant that you're probably about on target with your estimate, give or take some. Of course, that will increase database size what, tenfold? We only have a couple million pages.
Yeah, but that's the exact same problem - how does a full text search / tagging engine do it? It's facing the same issues we're talking about. The only think I can think of is they are using a completely different data structure (and that would be?) and/or a faster (compiled?) language.
They're definitely using a totally different data structure. They just index every word, I assume in a hash table or something. I'm pretty sure they don't store all the intersections, but presumably however they handle that it's fast enough to be workable.
One-second delays are fairly typical for Wikipedia searches, and I would expect the category list to be somewhat faster, since it's much smaller. (Even the largest categories are substantially smaller than the number of pages that contain some indexed words, like "all" [545,636 pages] or "not" [755,149 pages], and there are more of the latter than the former.)
Faster language isn't really relevant, since the slowdown is on the database side, not the PHP side. You can bet that MySQL is written in a compiled language.