Mashiah Davidson schrieb:
Yes, it takes lots of memory because MEMORY engine stores varchar data in an inefficient way and spends lots of memory for indexes,
Why do you store varchar data at all? It would be much more efficient to use id-to-id maps, no?
but on the other hand the processing takes just 1-2 hours for such wiki as ru or de.
On a database server, all available memory is usually reserved for the innodb cache, where it greately benefits query performance. If you want to use large chunks of memory for MEMORY tables, this memory can not be reserved fro innodb. So, it is unavailable to "normal" database operation. Sure, others could use it for thier own memory tables while you do not uzse it, but I doubt that this is much help. Basically, memory available for memory tables is not available for innodb.
This is the essentail conflict: basically, we would have to reserver 1/8 of all resources for your use (well, for use by memory tables - but I doubt anyone besides you used big memory tables).
My estimates for offline implementation made on the initial stage gave me much higher estimate on processing results actuality, that's why I've chosen sql.
I can see that it would be much more effort to implement these things by hand, but I don't see why it would be less efficient.
My data for dewiki is different. The amount of links between articles (excluding disambigs) after redirects throwing is around 33 milion. The source is here: http://toolserver.org/~mashiah/isolated/de.log. One may find lots of other interesting statistics there.
You are right, I was looking at the wrong numbers.
I have used the trivial edge store for analyzing the category structure before, and Neil Harris is currently working on an nice standalone implementation of this for Wikimedia Germany. This should allow recursive category lookup in microseconds.
I think category tree analysis takes (which is also there) - worst case - minutes for relatively large wiki (7 minutes for about 150 small wikipedias). On the output the categorytree graph is split into strongly connected components. With offline application just data download from the database could take more than Golem's processing time.
Speed vs. Memory is the usual tradeoff. We have found that Golem uses too much memory, and of course, the easy way to solve to problem is by using a slower (offline) aproach. I don't see a easy solution for this.
Anyway, my point is not about the category graph as such. I'm just saying that fast and memory-efficient network analysis is possible with this kind of architecture.
-- daniel