Re: [Wikitech-l] Category intersection

13 Sep 2006


      Simetrical <Simetrical+wikitech@...> writes:
...
On 9/12/06, Aerik Sylvan <aerik@...> wrote:
...
So, *if* throwing sql at the existing table is too slow (and it looks like
it probably is), then the next alternative is to create records for the
intersections themselves, so the retrieval query is a simple one.
Common queries would need to be stored and maintained separately, yes.
 This is fine and dandy as long as no one tries exotic intersections,
it wouldn't be too much more load than the current system, but that's
an empirical question whose answer is unknown and can't be assumed.
Well, I'm actually proposing to write all the non-empty interesections (up to
some limit) into the database *IF* throwing sql at the existing structure is too
inefficient.  These would be written when a page is saved (perhaps only if the
categories in the page have changed?)
...
...
So you can see that the number of combinations would get to be quite a lot.
I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the
categorylinks table?  Even a thousand categories (and Wikipedia has
far more) would work out to a *googol cubed* different combinations.
Not only is there not enough memory and disk space on the planet to
store that, there aren't that many *particles in the universe*.
Am I misunderstanding you?  Maybe only non-empty category
intersections?  That requires slightly more thought as to scalability.
 If you have a million pages, and each has two random categories,
there can be a maximum of a million intersections (if no two pages
have the same two categories).  If you bump that up to three
categories, however, the maximum possible number is three million;
four is six million, etc.  It's O(m*2^n), or linear-exponential time.
That's also far too high to be acceptable, unless you can guarantee
that n remains very small (which you can't).
It's not exponential actually (because we're only taking distinct *unordered*
calculations), after a bunch of research I found out that the number of
intersections is calculated in math-speak as "n choose k" (see
http://mathworld.wolfram.com/BinomialCoefficient.html) and the way to calculate
it (for each k value, which is the number of intersecting categories) is
n!/((n-k)!k!).  So for our purposes, I did a quick sampling of 10 English
Wikipedia pages, and got an average of 3.8 categories per page.  Let's round up
and say 4.  If we want to store intersections (with a limit of 4 intersecting
categories), it would be the sum of the little equation for values of k=2, k=3,
and k=4 (and n=4).  This is 11 intersections, in addition to the single
categories we're already storing.
If the page has 5 categories, the numbers go up, and instead of 11 we'd have to
store 25 categories.  (By corollary, we are doing 11 or 25 extra INSERT
statements, respectively...) So, if we had a histogram of categories for
wikipedia, we could predict quite accurately how many non-empty intersections
there would be.  I'd guess between 11 and 25 million.  What's the retrieval time
for a simple select query against 25 million records?
...
Basically, the most practical solution in SQL would probably be to
pick the ten thousand or hundred thousand or million most-visited
intersections and maintain them as virtual categories, so they can be 
retrieved in the same time as normal categories.  The only question is
what percentage of queries this will deal with, which (as I said) is
unknown at this point.  Domas seemed rather skeptical of the
efficiency of even this scheme, and given your benchmarks in a much
smaller wiki his skepticism seems quite warranted.
I suppose we could come up with some scheme to do this - cache intersections for
a period of time, then keep the ones that are popular... The only thing is, by
definition, this would be at query time, and it would take some figure around
(probably greater than) 2 tenths of a second for each one, until it is cached. 
Perhaps that's the way to do it - launch it as a beta feature, fill up the
cache, then release it fully.
...
Brion's idea of a full-text search seems to be the way to go at this
point, at least to me.  After all, searching for "apple pie" does
nothing but give you the intersection of articles containing the words
"apple" and "pie", and we already do that, so it can't be
overwhelmingly expensive, I suppose.  (But a disclaimer: I am not, in
fact, particularly knowledgeable in *any* of this stuff.)
On 9/12/06, David Gerard <dgerard@...> wrote:
...
Well, yes ... How does other software with arbitrary tags do it?
And that's a complementary way to go.
Yeah, but that's the exact same problem - how does a full text search / tagging
engine do it?  It's facing the same issues we're talking about.  The only think
I can think of is they are using a completely different data structure (and that
would be?) and/or a faster (compiled?) language.
Best Regards,
Aerik
P.S.  Sorry for the long post, but I wanted to leave everything in context.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category intersection