Re: [Wikitech-l] Category intersection

5 Sep 2006


      Simetrical wrote:
...
See http://bugs.wikimedia.org/show_bug.cgi?id=5244 and the various
things duped to it.  I'm pretty sure performance would be a major
issue here; for instance, finding the first 200 pages in a category is
limited to iterating over 200 members of the category, and likewise
for all other operations currently supported by categories (as well as
unions), but finding the first 200 pages in the intersection of two
categories has no upper bound on the number of iterations required:
you have to go through every page in each category in the event that
they have fewer than 200 shared pages and neither is a subset of the
other.
Has anyone written code that can handle this efficiently?  Is such
code even possible?
I (and I'm sure many others) have been following this topic on and off
for a long time.  It seems pretty clear that the majority of the
community (as represented by people who have voiced an opinion about
it) want this functionality (albeit with a few strong dissenters) but
the remaining issues are 1) how to implement it and 2) can it be
implemented efficiently.
After reading Brion and others' comments, it sounds to me like the
developer community seems to be allowing for the possibility that it
can be implemented efficiently.  I myself have written a version using
SQL on the existing schema, but this was rejected as too inefficient.
I think the next possible steps are for the development community to
come up with different acceptable implementations, and then toss them
back to the wikipedia community (the main "customer" for this
functionality).
For the purposes of evaluating possible solutions, I think one key
question recently brought up here has been under-discussed:  how often
will this be used?  If this will be used very frequently, then the
solution will have to be more streamlined and efficient than than if
it's going to get less usage.  There have been objections about using
various SQL methods (including mine) on the existing structure - but I
think these discussions must happen in the context of usage, and we
should determine if a SQL based solution is possible (specifically
MySQL - we really need a MySQL expert to comment on the performance
issues with the join/if exists/group by and count solutions, as we are
throwing around a lot of conjecture about its inner workings), or if
something else (like Brion's Lucene suggestion) will be necessary.
Regards,
Aerik

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category intersection