Re: [Toolserver-l] Server switch for s3/s4/s6, Monday morning UTC

31 Mar 2010

Mashiah Davidson schrieb:
...
  Yes, it takes lots of memory because MEMORY engine
stores varchar data
 in an inefficient way and spends lots of memory for indexes,  
Why do you store varchar data at all? It would be much more efficient to use
id-to-id maps, no?

...
  but on
 the other hand the processing takes just 1-2 hours for such wiki as ru
 or de. 
On a database server, all available memory is usually reserved for the innodb
cache, where it greately benefits query performance. If you want to use large
chunks of memory for MEMORY tables, this memory can not be reserved fro innodb.
So, it is unavailable to "normal" database operation. Sure, others could use it
for thier own memory tables while you do not uzse it, but I doubt that this is
much help. Basically, memory available for memory tables is not available for
innodb.

This is the essentail conflict: basically, we would have to reserver 1/8 of all
resources for your use (well, for use by memory tables - but I doubt anyone
besides you used big memory tables).

...
  My estimates for offline implementation made on the
initial
 stage gave me much higher estimate on processing results actuality,
 that's why I've chosen sql. 
I can see that it would be much more effort to implement these things by hand,
but I don't see why it would be less efficient.

...
  My data for dewiki is different. The amount of links
between articles
 (excluding disambigs) after redirects throwing is around 33 milion.
 The source is here: http://toolserver.org/~mashiah/isolated/de.log.
 One may find lots of other interesting statistics there. 
You are right, I was looking at the wrong numbers.

...
   I have used
the trivial edge store for analyzing the category structure before,
 and Neil Harris is currently working on an nice standalone implementation of
 this for Wikimedia Germany. This should allow recursive category lookup in
 microseconds.  
 I think category tree analysis takes (which is also there) - worst
 case - minutes for relatively large wiki (7 minutes for about 150
 small wikipedias). On the output the categorytree graph is split into
 strongly connected components. With offline application just data
 download from the database could take more than Golem's processing
 time. 
Speed vs. Memory is the usual tradeoff. We have found that Golem uses too much
memory, and of course, the easy way to solve to problem is by using a slower
(offline) aproach. I don't see a easy solution for this.

Anyway, my point is not about the category graph as such. I'm just saying that
fast and memory-efficient network analysis is possible with this kind of
architecture.

-- daniel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Server switch for s3/s4/s6, Monday morning UTC