Re: [Wikitech-l] Category intersection

12 Sep 2006


      On 9/12/06, Aerik Sylvan aerik@thesylvans.com wrote:
...
So, *if* throwing sql at the existing table is too slow (and it looks like
it probably is), then the next alternative is to create records for the
intersections themselves, so the retrieval query is a simple one.
Common queries would need to be stored and maintained separately, yes.
 This is fine and dandy as long as no one tries exotic intersections,
it wouldn't be too much more load than the current system, but that's
an empirical question whose answer is unknown and can't be assumed.
...
So you can see that the number of combinations would get to be quite a lot.
I would propose writing them into the categorylinks table
Wait, write an *exponential* number of combinations into the
categorylinks table?  Even a thousand categories (and Wikipedia has
far more) would work out to a *googol cubed* different combinations.
Not only is there not enough memory and disk space on the planet to
store that, there aren't that many *particles in the universe*.
Am I misunderstanding you?  Maybe only non-empty category
intersections?  That requires slightly more thought as to scalability.
 If you have a million pages, and each has two random categories,
there can be a maximum of a million intersections (if no two pages
have the same two categories).  If you bump that up to three
categories, however, the maximum possible number is three million;
four is six million, etc.  It's O(m*2^n), or linear-exponential time.
That's also far too high to be acceptable, unless you can guarantee
that n remains very small (which you can't).
Basically, the most practical solution in SQL would probably be to
pick the ten thousand or hundred thousand or million most-visited
intersections and maintain them as virtual categories, so they can be
retrieved in the same time as normal categories.  The only question is
what percentage of queries this will deal with, which (as I said) is
unknown at this point.  Domas seemed rather skeptical of the
efficiency of even this scheme, and given your benchmarks in a much
smaller wiki his skepticism seems quite warranted.
Brion's idea of a full-text search seems to be the way to go at this
point, at least to me.  After all, searching for "apple pie" does
nothing but give you the intersection of articles containing the words
"apple" and "pie", and we already do that, so it can't be
overwhelmingly expensive, I suppose.  (But a disclaimer: I am not, in
fact, particularly knowledgeable in *any* of this stuff.)
On 9/12/06, David Gerard dgerard@gmail.com wrote:
...
Well, yes ... How does other software with arbitrary tags do it?
And that's a complementary way to go.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category intersection