Re: [Commons-l] Our categories are broken, very broken

21 Sep 2008


      Gregory Maxwell schrieb:
...
People keep waving hands on this. It's time to "put up or shut up",
it's counterproductive that our less technical community thinks some
technical bullet is forth-coming.  At least suggest a data-structure
which can provide for reasonably fast worst case updates (i.e. nothing
worse than we have with changing templates today, O(N) on direct
users) and lightning fast intersections, if you can't or no one else
can, we need to just consider it impossible until someone does.
We can get roughly constant time lookup for intersections (well
really, linear with returned results, but thats fine) but I am not
aware of any way to accommodate this for tree data without making the
worst cast *update* hideously expensive like O(N*Avg_width^avg_depth)
random seeks.
I don't think there's a magic bullet, but non the less, when thinking about what
is possible and what isn't, we should look at the techniques that have been
tried so far, and get to know their strenths and limitations. And while I have
thought about the problem quite a bit, I have not yet looked at what the two
systems in question actually do.
...
...
Actually, keeping in RAM the structure of, say, a million categories, where each
is itself in three categories, means three million pairs of IDs, each four byte
wide, that's 24MB. Not so terrible. I have actually done that to run a cycle
detection on the category graphs, it's quite fast (with java, anyway). But it's
not fast enough to handle deep intersection of categories for dozents if not
hundreds of requests per second, as would have to be expected for wikipedia.
Thats cycle detection. Yes, you can do single point graph expansion
cheaply enough, but you can't do graph expansion AND intersection
cheaply because that expanding the entire tree, not just some subset.
Imagine fully expanding the entire hierarchy breadth first (stopping
at previously visited nodes to cut cycles), then storing it all in an
inverted index posting list.  The result is *enormous*.  It's fast
enough for lookups, but single update actions would potentially
require visiting every node many times, it might work purely in ram,
but it won't work on disk.
Well, that's what I meant by "it's not fast enough to handle deep intersection
of categories". you could just build two sets of pages recursively, and then
intersect them - don't use much space, but it's of course exponential, even
though relatively fast when don im ram. Pre-computed sets would indeed be huge.
...
...
Hm... how fast is the growth of wikipedia in relation to the rate at which
computing power is increasing? My impression is that the latter is coming along
faster, so more complex operations become feasible with time.
Doesn't matter if Wikipedia (thought we were talking about commons)
has small constant growth if tree expansion makes updates to the index
exponential.
I was thinking of mediawiki in general, as applied to wikimedia projects in
general. Wikipedia as the largest and fastest growing being the main concern to
performance. Anyway, all three -- wikipedia, the expanded tree, and computing
power -- grow exponentially, so it's a matter of factors. But you are probably
right in that full tree expansion will not be compensated by more cycles.
...
"One good way" is a common trap for simple minds. Sometimes there
simply isn't one way to rule them all.  ;)
True. On the other hand, a patchword of multiple half-working solutions
generally makes things worse. Different tools for different things. But not too
many (incompatible) ways to do the same thing.
...
The obvious way to wed the systems is to simply place the tags into
the appropriate categories.   The category system then becomes a
manual navigation tool to help people find related tags.   It wouldn't
stop sucking, but the suckage would be less impacting since it would
be abstracted a layer back, and people could work directly with tags.
This would be possible already if we simply treated leaf categories as
tags, but people obsessively convert an image with
[[Category:Bridges]], [[Category:Stone architecture]],
[[Category:Transportation in Scottland]], [[Category:Built in 1702]]
into [[Category:Stone bridges in scottland build in the 1700s and one
used by a transsexual bungee jumper on a Tuesday]] which no one would
every find or think to check. :)
This is something that could be mended at least to some extend by good policy
and education. It works ok on the German Wikipedia (not great, but better than
on commons). This goes hand in hand with user interface features. Currently,
huge categories are useless except for special tools, so they get split up,
often into corss-section-categories. If we had a nice UI for category
intersections (and a clean way to link to them, etc), big categories would be OK
and could be used like tags.
Changing and enhancing the way categories work or are used is what we need.
Adding another system to the mix would be fatal, I think.
-- daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] Our categories are broken, very broken