On Sat, Sep 20, 2008 at 3:20 AM, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
True, and I have not come up with a good way either,
but perhaps we should look
some more into what SMW does - it does support this kind of thing, right? How
does it scale? And wasn't Magnus working on some intersection thingy?
People keep waving hands on this. It's time to "put up or shut up",
it's counterproductive that our less technical community thinks some
technical bullet is forth-coming. At least suggest a data-structure
which can provide for reasonably fast worst case updates (i.e. nothing
worse than we have with changing templates today, O(N) on direct
users) and lightning fast intersections, if you can't or no one else
can, we need to just consider it impossible until someone does.
We can get roughly constant time lookup for intersections (well
really, linear with returned results, but thats fine) but I am not
aware of any way to accommodate this for tree data without making the
worst cast *update* hideously expensive like O(N*Avg_width^avg_depth)
random seeks.
Actually, keeping in RAM the structure of, say, a
million categories, where each
is itself in three categories, means three million pairs of IDs, each four byte
wide, that's 24MB. Not so terrible. I have actually done that to run a cycle
detection on the category graphs, it's quite fast (with java, anyway). But it's
not fast enough to handle deep intersection of categories for dozents if not
hundreds of requests per second, as would have to be expected for wikipedia.
Thats cycle detection. Yes, you can do single point graph expansion
cheaply enough, but you can't do graph expansion AND intersection
cheaply because that expanding the entire tree, not just some subset.
Imagine fully expanding the entire hierarchy breadth first (stopping
at previously visited nodes to cut cycles), then storing it all in an
inverted index posting list. The result is *enormous*. It's fast
enough for lookups, but single update actions would potentially
require visiting every node many times, it might work purely in ram,
but it won't work on disk.
[snip]
Hm... how fast is the growth of wikipedia in relation
to the rate at which
computing power is increasing? My impression is that the latter is coming along
faster, so more complex operations become feasible with time.
Doesn't matter if Wikipedia (thought we were talking about commons)
has small constant growth if tree expansion makes updates to the index
exponential.
Regardless of
the non-technical arguments against categories the above
should be sufficient reason for us to adopt a tagging system. Just so
we can build really good search tools. It could run in parallel with
the category system, as the category system isn't completely worthless
and it will take a long time to get tags applied to everything.
I think that would be terrible. We would have two messes in stead of one. Two
systems that do kind of the say, but don't interact in a meaningful way. Endless
arguments, more of the gallery vs. category stuff. Ugh. No. Let's fine *one*
good way.
"One good way" is a common trap for simple minds. Sometimes there
simply isn't one way to rule them all. ;)
The obvious way to wed the systems is to simply place the tags into
the appropriate categories. The category system then becomes a
manual navigation tool to help people find related tags. It wouldn't
stop sucking, but the suckage would be less impacting since it would
be abstracted a layer back, and people could work directly with tags.
This would be possible already if we simply treated leaf categories as
tags, but people obsessively convert an image with
[[Category:Bridges]], [[Category:Stone architecture]],
[[Category:Transportation in Scottland]], [[Category:Built in 1702]]
into [[Category:Stone bridges in scottland build in the 1700s and one
used by a transsexual bungee jumper on a Tuesday]] which no one would
every find or think to check. :)