Re: [Wikitech-l] Category intersection

13 Sep 2006


      On 9/13/06, Aerik aerik@thesylvans.com wrote:
...
It's not exponential actually (because we're only taking distinct *unordered*
calculations), after a bunch of research I found out that the number of
intersections is calculated in math-speak as "n choose k" (see
http://mathworld.wolfram.com/BinomialCoefficient.html) and the way to calculate
it (for each k value, which is the number of intersecting categories) is
n!/((n-k)!k!).
Erk, you're right.  I was too hasty.  It's O(n choose k), but it's
easily provable that that's the same as O(n^k), which is of course
very slow for even smallish k's (although not ridiculously so like
exponentials).  On the other hand, if *n* remains small, say ten or
whatever, then it's not such a major issue, I suppose.
...
So, if we had a histogram of categories for
wikipedia, we could predict quite accurately how many non-empty intersections
there would be.  I'd guess between 11 and 25 million.  What's the retrieval time
for a simple select query against 25 million records?
Looking at it that way, I'll grant that you're probably about on
target with your estimate, give or take some.  Of course, that will
increase database size what, tenfold?  We only have a couple million
pages.
...
Yeah, but that's the exact same problem - how does a full text search / tagging
engine do it?  It's facing the same issues we're talking about.  The only think
I can think of is they are using a completely different data structure (and that
would be?) and/or a faster (compiled?) language.
They're definitely using a totally different data structure.  They
just index every word, I assume in a hash table or something.  I'm
pretty sure they don't store all the intersections, but presumably
however they handle that it's fast enough to be workable.
One-second delays are fairly typical for Wikipedia searches, and I
would expect the category list to be somewhat faster, since it's much
smaller.  (Even the largest categories are substantially smaller than
the number of pages that contain some indexed words, like "all"
[545,636 pages] or "not" [755,149 pages], and there are more of the
latter than the former.)
Faster language isn't really relevant, since the slowdown is on the
database side, not the PHP side.  You can bet that MySQL is written in
a compiled language.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category intersection