Simetrical wrote:
See http://bugs.wikimedia.org/show_bug.cgi?id=5244 and the various things duped to it. I'm pretty sure performance would be a major issue here; for instance, finding the first 200 pages in a category is limited to iterating over 200 members of the category, and likewise for all other operations currently supported by categories (as well as unions), but finding the first 200 pages in the intersection of two categories has no upper bound on the number of iterations required: you have to go through every page in each category in the event that they have fewer than 200 shared pages and neither is a subset of the other.
Has anyone written code that can handle this efficiently? Is such code even possible?
I (and I'm sure many others) have been following this topic on and off for a long time. It seems pretty clear that the majority of the community (as represented by people who have voiced an opinion about it) want this functionality (albeit with a few strong dissenters) but the remaining issues are 1) how to implement it and 2) can it be implemented efficiently.
After reading Brion and others' comments, it sounds to me like the developer community seems to be allowing for the possibility that it can be implemented efficiently. I myself have written a version using SQL on the existing schema, but this was rejected as too inefficient. I think the next possible steps are for the development community to come up with different acceptable implementations, and then toss them back to the wikipedia community (the main "customer" for this functionality).
For the purposes of evaluating possible solutions, I think one key question recently brought up here has been under-discussed: how often will this be used? If this will be used very frequently, then the solution will have to be more streamlined and efficient than than if it's going to get less usage. There have been objections about using various SQL methods (including mine) on the existing structure - but I think these discussions must happen in the context of usage, and we should determine if a SQL based solution is possible (specifically MySQL - we really need a MySQL expert to comment on the performance issues with the join/if exists/group by and count solutions, as we are throwing around a lot of conjecture about its inner workings), or if something else (like Brion's Lucene suggestion) will be necessary.
Regards, Aerik