On Sat, Sep 20, 2008 at 3:20 AM, Daniel Kinzler daniel@brightbyte.de wrote:
True, and I have not come up with a good way either, but perhaps we should look some more into what SMW does - it does support this kind of thing, right? How does it scale? And wasn't Magnus working on some intersection thingy?
People keep waving hands on this. It's time to "put up or shut up", it's counterproductive that our less technical community thinks some technical bullet is forth-coming. At least suggest a data-structure which can provide for reasonably fast worst case updates (i.e. nothing worse than we have with changing templates today, O(N) on direct users) and lightning fast intersections, if you can't or no one else can, we need to just consider it impossible until someone does.
We can get roughly constant time lookup for intersections (well really, linear with returned results, but thats fine) but I am not aware of any way to accommodate this for tree data without making the worst cast *update* hideously expensive like O(N*Avg_width^avg_depth) random seeks.
Actually, keeping in RAM the structure of, say, a million categories, where each is itself in three categories, means three million pairs of IDs, each four byte wide, that's 24MB. Not so terrible. I have actually done that to run a cycle detection on the category graphs, it's quite fast (with java, anyway). But it's not fast enough to handle deep intersection of categories for dozents if not hundreds of requests per second, as would have to be expected for wikipedia.
Thats cycle detection. Yes, you can do single point graph expansion cheaply enough, but you can't do graph expansion AND intersection cheaply because that expanding the entire tree, not just some subset. Imagine fully expanding the entire hierarchy breadth first (stopping at previously visited nodes to cut cycles), then storing it all in an inverted index posting list. The result is *enormous*. It's fast enough for lookups, but single update actions would potentially require visiting every node many times, it might work purely in ram, but it won't work on disk.
[snip]
Hm... how fast is the growth of wikipedia in relation to the rate at which computing power is increasing? My impression is that the latter is coming along faster, so more complex operations become feasible with time.
Doesn't matter if Wikipedia (thought we were talking about commons) has small constant growth if tree expansion makes updates to the index exponential.
Regardless of the non-technical arguments against categories the above should be sufficient reason for us to adopt a tagging system. Just so we can build really good search tools. It could run in parallel with the category system, as the category system isn't completely worthless and it will take a long time to get tags applied to everything.
I think that would be terrible. We would have two messes in stead of one. Two systems that do kind of the say, but don't interact in a meaningful way. Endless arguments, more of the gallery vs. category stuff. Ugh. No. Let's fine *one* good way.
"One good way" is a common trap for simple minds. Sometimes there simply isn't one way to rule them all. ;)
The obvious way to wed the systems is to simply place the tags into the appropriate categories. The category system then becomes a manual navigation tool to help people find related tags. It wouldn't stop sucking, but the suckage would be less impacting since it would be abstracted a layer back, and people could work directly with tags.
This would be possible already if we simply treated leaf categories as tags, but people obsessively convert an image with [[Category:Bridges]], [[Category:Stone architecture]], [[Category:Transportation in Scottland]], [[Category:Built in 1702]] into [[Category:Stone bridges in scottland build in the 1700s and one used by a transsexual bungee jumper on a Tuesday]] which no one would every find or think to check. :)