On Sat, Mar 1, 2008 at 7:44 PM, Steve Bennett stevagewp@gmail.com wrote:
What kind of use cases do you imagine for category intersection? I suspect that as soon as you open the door to intersections ("hey everyone you can find all the articles that are both 1895 deaths and poets!") then people will want arbitrary numbers of intersections ("hmm, '1895 deaths, poets and unreferenced' didn't work"). Maybe it would be possible, along the lines you suggest, to allow greater numbers of intersections in some time-restricted way ("You requested a 4-level intersection less than a minute ago. Please wait and try again.")
Throttling is a fairly ugly solution. There's not a great reason for it to be necessary either. With this mechanism, we could allow up to five- or six-way intersections reasonably enough. It would just be a range scan, picking 10 or 15 or whatever rows and then merging them PHP-side. This is a primary key lookup and should be very fast. In fact, we don't even need a reverse index, or at least I can't think why we would: reverse lookups (pages to intersections) could just use categorylinks. I don't think a primary-key lookup on InnoDB of under 20 rows per query is anything to worry much about.
The problem with allowing a large number of intersections, with this method, is that it might be that too many rows are returned, if one of the component intersections is too large. You could easily get hundreds or thousands of rows being returned if you retrieved all Americans + politicians, or whatever. If you wanted all American politician/artists, you'd get all American politicians, all American artists, and all politician/artists. So that's actually a problem for even three-way intersection. And retrieving in sorted order will be a problem.
So I guess I've answered my own question from before, about drawbacks. Fulltext of some kind is probably a better solution than this.
Is it not feasible, for example, to perform all intersections on a different machine or something? I seem to recall someone was operating a 3rd party category intersection site...?
Third parties or users of the toolserver may implement poor-performance intersections. Users of those services may be willing to wait a long time for results, because they recognize that they're not using an official service. And they don't have the traffic that an integrated feature would have, probably not to within an order of magnitude. So an inefficient implementation is not really acceptable for running on the main servers, even if it might be by a third party.