Re: [Commons-l] Our categories are broken, very broken

20 Sep 2008


      ...
I had hoped that it would spur interest in adopting or changing to
some tag system but instead I mostly got people complaining that it
didn't pick up subcategories.  I pointed that that (1) picking up
subcategories is technically infeasible in this kind of fast indexing
scheme (imagine you move a huge second level category into another
category, now the database must make millions of expensive updates)
and that
Just to complete the list of tools that do this, let me point to
http://toolserver.org/~daniel/WikiSense/CatScan.php.
...
(2) our current system produces utter nonsense if you flatten
the categories (a subject I've posted on separately several times).
Mostly (2) got answers suggesting various heuristics like "only
traverse N deep", which also fails but in more subtle ways (punishing
deep categorization by 'losing' that content, and still producing
nonsense results too just less often and less offensively) but most
importantly does not solve (1).
Indeed, as is evident there. I think the only good way to get around this would
 be to use a facetted categorization scheme.
...
I think we've reached a point where many technical people here have
thought about this problem for a long time (years in some cases) and
over and over again have concluded that we can not have fast
lookup/union/intersection/differencing tools for categories *AND* have
the tree auto-expanded in the results.
True, and I have not come up with a good way either, but perhaps we should look
some more into what SMW does - it does support this kind of thing, right? How
does it scale? And wasn't Magnus working on some intersection thingy?
...
If commons were 1/20th its
size and not growing at a good rate, then yes, we could. But since the
full expanded category tree will not fit in ram even on my 32gbyte
system, it just can not work.   Categories as a useful lookup tool for
users is highly limited for purely technical reasons if nothing else.
Actually, keeping in RAM the structure of, say, a million categories, where each
is itself in three categories, means three million pairs of IDs, each four byte
wide, that's 24MB. Not so terrible. I have actually done that to run a cycle
detection on the category graphs, it's quite fast (with java, anyway). But it's
not fast enough to handle deep intersection of categories for dozents if not
hundreds of requests per second, as would have to be expected for wikipedia.
...
Otherwise,
we should consider further significant technical improvements to the
existing category system to be technically non-viable.
Hm... how fast is the growth of wikipedia in relation to the rate at which
computing power is increasing? My impression is that the latter is coming along
faster, so more complex operations become feasible with time.
...
Regardless of the non-technical arguments against categories the above
should be sufficient reason for us to adopt a tagging system. Just so
we can build really good search tools.  It could run in parallel with
the category system, as the category system isn't completely worthless
and it will take a long time to get tags applied to everything.
I think that would be terrible. We would have two messes in stead of one. Two
systems that do kind of the say, but don't interact in a meaningful way. Endless
arguments, more of  the gallery vs. category stuff. Ugh. No. Let's fine *one*
good way.
-- daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] Our categories are broken, very broken