Re: [Commons-l] Our categories are broken, very broken

20 Sep 2008

...
  I had hoped that it would spur interest in adopting or
changing to
 some tag system but instead I mostly got people complaining that it
 didn't pick up subcategories.  I pointed that that (1) picking up
 subcategories is technically infeasible in this kind of fast indexing
 scheme (imagine you move a huge second level category into another
 category, now the database must make millions of expensive updates)
 and that  
Just to complete the list of tools that do this, let me point to
<http://toolserver.org/~daniel/WikiSense/CatScan.php>.

...
  (2) our current system produces utter nonsense if you
flatten
 the categories (a subject I've posted on separately several times).

 Mostly (2) got answers suggesting various heuristics like "only
 traverse N deep", which also fails but in more subtle ways (punishing
 deep categorization by 'losing' that content, and still producing
 nonsense results too just less often and less offensively) but most
 importantly does not solve (1). 
Indeed, as is evident there. I think the only good way to get around this would
 be to use a facetted categorization scheme.

...
  I think we've reached a point where many technical
people here have
 thought about this problem for a long time (years in some cases) and
 over and over again have concluded that we can not have fast
 lookup/union/intersection/differencing tools for categories *AND* have
 the tree auto-expanded in the results.   
True, and I have not come up with a good way either, but perhaps we should look
some more into what SMW does - it does support this kind of thing, right? How
does it scale? And wasn't Magnus working on some intersection thingy?

...
  If commons were 1/20th its
 size and not growing at a good rate, then yes, we could. But since the
 full expanded category tree will not fit in ram even on my 32gbyte
 system, it just can not work.   Categories as a useful lookup tool for
 users is highly limited for purely technical reasons if nothing else. 
Actually, keeping in RAM the structure of, say, a million categories, where each
is itself in three categories, means three million pairs of IDs, each four byte
wide, that's 24MB. Not so terrible. I have actually done that to run a cycle
detection on the category graphs, it's quite fast (with java, anyway). But it's
not fast enough to handle deep intersection of categories for dozents if not
hundreds of requests per second, as would have to be expected for wikipedia.

...
  Otherwise,
 we should consider further significant technical improvements to the
 existing category system to be technically non-viable. 
Hm... how fast is the growth of wikipedia in relation to the rate at which
computing power is increasing? My impression is that the latter is coming along
faster, so more complex operations become feasible with time.

...
  Regardless of the non-technical arguments against
categories the above
 should be sufficient reason for us to adopt a tagging system. Just so
 we can build really good search tools.  It could run in parallel with
 the category system, as the category system isn't completely worthless
 and it will take a long time to get tags applied to everything. 
I think that would be terrible. We would have two messes in stead of one. Two
systems that do kind of the say, but don't interact in a meaningful way. Endless
arguments, more of  the gallery vs. category stuff. Ugh. No. Let's fine *one*
good way.

-- daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] Our categories are broken, very broken