Simetrical wrote:
See http://bugs.wikimedia.org/show_bug.cgi?id=5244 and the various things duped to it. I'm pretty sure performance would be a major issue here; for instance, finding the first 200 pages in a category is limited to iterating over 200 members of the category, and likewise for all other operations currently supported by categories (as well as unions), but finding the first 200 pages in the intersection of two categories has no upper bound on the number of iterations required: you have to go through every page in each category in the event that they have fewer than 200 shared pages and neither is a subset of the other.
Has anyone written code that can handle this efficiently? Is such code even possible?
I (and I'm sure many others) have been following this topic on and off for a long time. It seems pretty clear that the majority of the community (as represented by people who have voiced an opinion about it) want this functionality (albeit with a few strong dissenters) but the remaining issues are 1) how to implement it and 2) can it be implemented efficiently.
After reading Brion and others' comments, it sounds to me like the developer community seems to be allowing for the possibility that it can be implemented efficiently. I myself have written a version using SQL on the existing schema, but this was rejected as too inefficient. I think the next possible steps are for the development community to come up with different acceptable implementations, and then toss them back to the wikipedia community (the main "customer" for this functionality).
For the purposes of evaluating possible solutions, I think one key question recently brought up here has been under-discussed: how often will this be used? If this will be used very frequently, then the solution will have to be more streamlined and efficient than than if it's going to get less usage. There have been objections about using various SQL methods (including mine) on the existing structure - but I think these discussions must happen in the context of usage, and we should determine if a SQL based solution is possible (specifically MySQL - we really need a MySQL expert to comment on the performance issues with the join/if exists/group by and count solutions, as we are throwing around a lot of conjecture about its inner workings), or if something else (like Brion's Lucene suggestion) will be necessary.
Regards, Aerik
On 05/09/06, Aerik Sylvan aerik@thesylvans.com wrote:
For the purposes of evaluating possible solutions, I think one key question recently brought up here has been under-discussed: how often will this be used? If this will be used very frequently, then the solution will have to be more streamlined and efficient than than if it's going to get less usage.
I think if you give people this sort of combinable tagging, not mere categories as we have them, they'll go wild with it. So assume it will be popular.
- d.
On 9/5/06, Aerik Sylvan aerik@thesylvans.com wrote:
For the purposes of evaluating possible solutions, I think one key question recently brought up here has been under-discussed: how often will this be used? If this will be used very frequently, then the solution will have to be more streamlined and efficient than than if it's going to get less usage.
I'm guessing very high usage. A category such as "American people" is fairly worthless by itself, after all.
we really need a MySQL expert to comment on the performance issues with the join/if exists/group by and count solutions
Well, Domas certainly qualifies as a MySQL expert.
On 9/6/06, Simetrical Simetrical+wikitech@gmail.com wrote:
I'm guessing very high usage. A category such as "American people" is fairly worthless by itself, after all.
As a "category" by *itself*, yes. As a tag, combined with other attributes for searching, it's very valuable.
(not that we have those yet)
Steve
On 9/11/06, Steve Bennett stevage@gmail.com wrote:
On 9/6/06, Simetrical Simetrical+wikitech@gmail.com wrote:
I'm guessing very high usage. A category such as "American people" is fairly worthless by itself, after all.
As a "category" by *itself*, yes. As a tag, combined with other attributes for searching, it's very valuable.
Yes . . . but combining it with other attributes for searching means category intersection, or some analogue. Which is why there would be high usage.
wikitech-l@lists.wikimedia.org