On Fri, Sep 19, 2008 at 8:34 AM, Daniel Schwen <lists(a)schwen.de> wrote:
Unfortunately you are comparing Apples to Apple
Orchards.
And maybe I wasn't making myself clear enough. Finding common super-categories
is actually not the biggest problem. That direction branches less then gowing
_down_ in the tree.
[snip]
I just wanted to step up and point out that Daniel Schwen is correct.
I created an example tool sometime back (perhaps some of you remember
it?) which provided instant (i.e. like google search, <10ms
computation times) results for arbitrary commons category
intersections, differences, and unions.
The system worked by taking a dump of the commons categories and
treating them like tags and indexing them with a special inverted
index which is very good at those kinds of set operations. Because
it was using the existing categories the results were not
super-useful.
I had hoped that it would spur interest in adopting or changing to
some tag system but instead I mostly got people complaining that it
didn't pick up subcategories. I pointed that that (1) picking up
subcategories is technically infeasible in this kind of fast indexing
scheme (imagine you move a huge second level category into another
category, now the database must make millions of expensive updates)
and that (2) our current system produces utter nonsense if you flatten
the categories (a subject I've posted on separately several times).
Mostly (2) got answers suggesting various heuristics like "only
traverse N deep", which also fails but in more subtle ways (punishing
deep categorization by 'losing' that content, and still producing
nonsense results too just less often and less offensively) but most
importantly does not solve (1).
I think we've reached a point where many technical people here have
thought about this problem for a long time (years in some cases) and
over and over again have concluded that we can not have fast
lookup/union/intersection/differencing tools for categories *AND* have
the tree auto-expanded in the results. If commons were 1/20th its
size and not growing at a good rate, then yes, we could. But since the
full expanded category tree will not fit in ram even on my 32gbyte
system, it just can not work. Categories as a useful lookup tool for
users is highly limited for purely technical reasons if nothing else.
If there is any technical person here who thinks otherwise, please
feel free to engage me in a sidebar conversation and I'll either
convince you that you're wrong, or I'll gladly eat crow. Otherwise,
we should consider further significant technical improvements to the
existing category system to be technically non-viable.
Regardless of the non-technical arguments against categories the above
should be sufficient reason for us to adopt a tagging system. Just so
we can build really good search tools. It could run in parallel with
the category system, as the category system isn't completely worthless
and it will take a long time to get tags applied to everything.