Mashiah Davidson schrieb:
Yes, it takes lots of memory because MEMORY engine
stores varchar data
in an inefficient way and spends lots of memory for indexes,
Why do you store varchar data at all? It would be much more efficient to use
id-to-id maps, no?
but on
the other hand the processing takes just 1-2 hours for such wiki as ru
or de.
On a database server, all available memory is usually reserved for the innodb
cache, where it greately benefits query performance. If you want to use large
chunks of memory for MEMORY tables, this memory can not be reserved fro innodb.
So, it is unavailable to "normal" database operation. Sure, others could use it
for thier own memory tables while you do not uzse it, but I doubt that this is
much help. Basically, memory available for memory tables is not available for
innodb.
This is the essentail conflict: basically, we would have to reserver 1/8 of all
resources for your use (well, for use by memory tables - but I doubt anyone
besides you used big memory tables).
My estimates for offline implementation made on the
initial
stage gave me much higher estimate on processing results actuality,
that's why I've chosen sql.
I can see that it would be much more effort to implement these things by hand,
but I don't see why it would be less efficient.
My data for dewiki is different. The amount of links
between articles
(excluding disambigs) after redirects throwing is around 33 milion.
The source is here:
http://toolserver.org/~mashiah/isolated/de.log.
One may find lots of other interesting statistics there.
You are right, I was looking at the wrong numbers.
I have used
the trivial edge store for analyzing the category structure before,
and Neil Harris is currently working on an nice standalone implementation of
this for Wikimedia Germany. This should allow recursive category lookup in
microseconds.
I think category tree analysis takes (which is also there) - worst
case - minutes for relatively large wiki (7 minutes for about 150
small wikipedias). On the output the categorytree graph is split into
strongly connected components. With offline application just data
download from the database could take more than Golem's processing
time.
Speed vs. Memory is the usual tradeoff. We have found that Golem uses too much
memory, and of course, the easy way to solve to problem is by using a slower
(offline) aproach. I don't see a easy solution for this.
Anyway, my point is not about the category graph as such. I'm just saying that
fast and memory-efficient network analysis is possible with this kind of
architecture.
-- daniel