Gregory Maxwell schrieb:
People keep waving hands on this. It's time to "put up or shut up", it's counterproductive that our less technical community thinks some technical bullet is forth-coming. At least suggest a data-structure which can provide for reasonably fast worst case updates (i.e. nothing worse than we have with changing templates today, O(N) on direct users) and lightning fast intersections, if you can't or no one else can, we need to just consider it impossible until someone does.
We can get roughly constant time lookup for intersections (well really, linear with returned results, but thats fine) but I am not aware of any way to accommodate this for tree data without making the worst cast *update* hideously expensive like O(N*Avg_width^avg_depth) random seeks.
I don't think there's a magic bullet, but non the less, when thinking about what is possible and what isn't, we should look at the techniques that have been tried so far, and get to know their strenths and limitations. And while I have thought about the problem quite a bit, I have not yet looked at what the two systems in question actually do.
Actually, keeping in RAM the structure of, say, a million categories, where each is itself in three categories, means three million pairs of IDs, each four byte wide, that's 24MB. Not so terrible. I have actually done that to run a cycle detection on the category graphs, it's quite fast (with java, anyway). But it's not fast enough to handle deep intersection of categories for dozents if not hundreds of requests per second, as would have to be expected for wikipedia.
Thats cycle detection. Yes, you can do single point graph expansion cheaply enough, but you can't do graph expansion AND intersection cheaply because that expanding the entire tree, not just some subset. Imagine fully expanding the entire hierarchy breadth first (stopping at previously visited nodes to cut cycles), then storing it all in an inverted index posting list. The result is *enormous*. It's fast enough for lookups, but single update actions would potentially require visiting every node many times, it might work purely in ram, but it won't work on disk.
Well, that's what I meant by "it's not fast enough to handle deep intersection of categories". you could just build two sets of pages recursively, and then intersect them - don't use much space, but it's of course exponential, even though relatively fast when don im ram. Pre-computed sets would indeed be huge.
Hm... how fast is the growth of wikipedia in relation to the rate at which computing power is increasing? My impression is that the latter is coming along faster, so more complex operations become feasible with time.
Doesn't matter if Wikipedia (thought we were talking about commons) has small constant growth if tree expansion makes updates to the index exponential.
I was thinking of mediawiki in general, as applied to wikimedia projects in general. Wikipedia as the largest and fastest growing being the main concern to performance. Anyway, all three -- wikipedia, the expanded tree, and computing power -- grow exponentially, so it's a matter of factors. But you are probably right in that full tree expansion will not be compensated by more cycles.
"One good way" is a common trap for simple minds. Sometimes there simply isn't one way to rule them all. ;)
True. On the other hand, a patchword of multiple half-working solutions generally makes things worse. Different tools for different things. But not too many (incompatible) ways to do the same thing.
The obvious way to wed the systems is to simply place the tags into the appropriate categories. The category system then becomes a manual navigation tool to help people find related tags. It wouldn't stop sucking, but the suckage would be less impacting since it would be abstracted a layer back, and people could work directly with tags.
This would be possible already if we simply treated leaf categories as tags, but people obsessively convert an image with [[Category:Bridges]], [[Category:Stone architecture]], [[Category:Transportation in Scottland]], [[Category:Built in 1702]] into [[Category:Stone bridges in scottland build in the 1700s and one used by a transsexual bungee jumper on a Tuesday]] which no one would every find or think to check. :)
This is something that could be mended at least to some extend by good policy and education. It works ok on the German Wikipedia (not great, but better than on commons). This goes hand in hand with user interface features. Currently, huge categories are useless except for special tools, so they get split up, often into corss-section-categories. If we had a nice UI for category intersections (and a clean way to link to them, etc), big categories would be OK and could be used like tags.
Changing and enhancing the way categories work or are used is what we need. Adding another system to the mix would be fatal, I think.
-- daniel