On Sat, Mar 1, 2008 at 7:44 PM, Steve Bennett <stevagewp(a)gmail.com> wrote:
What kind of use cases do you imagine for category
intersection? I
suspect that as soon as you open the door to intersections ("hey
everyone you can find all the articles that are both 1895 deaths and
poets!") then people will want arbitrary numbers of intersections
("hmm, '1895 deaths, poets and unreferenced' didn't work"). Maybe
it
would be possible, along the lines you suggest, to allow greater
numbers of intersections in some time-restricted way ("You requested a
4-level intersection less than a minute ago. Please wait and try
again.")
Throttling is a fairly ugly solution. There's not a great reason for
it to be necessary either. With this mechanism, we could allow up to
five- or six-way intersections reasonably enough. It would just be a
range scan, picking 10 or 15 or whatever rows and then merging them
PHP-side. This is a primary key lookup and should be very fast. In
fact, we don't even need a reverse index, or at least I can't think
why we would: reverse lookups (pages to intersections) could just use
categorylinks. I don't think a primary-key lookup on InnoDB of under
20 rows per query is anything to worry much about.
The problem with allowing a large number of intersections, with this
method, is that it might be that too many rows are returned, if one of
the component intersections is too large. You could easily get
hundreds or thousands of rows being returned if you retrieved all
Americans + politicians, or whatever. If you wanted all American
politician/artists, you'd get all American politicians, all American
artists, and all politician/artists. So that's actually a problem for
even three-way intersection. And retrieving in sorted order will be a
problem.
So I guess I've answered my own question from before, about drawbacks.
Fulltext of some kind is probably a better solution than this.
Is it not feasible, for example, to perform all
intersections on a
different machine or something? I seem to recall someone was operating
a 3rd party category intersection site...?
Third parties or users of the toolserver may implement
poor-performance intersections. Users of those services may be
willing to wait a long time for results, because they recognize that
they're not using an official service. And they don't have the
traffic that an integrated feature would have, probably not to within
an order of magnitude. So an inefficient implementation is not really
acceptable for running on the main servers, even if it might be by a
third party.