[Toolserver-l] Server switch for s3/s4/s6, Monday morning UTC

Mashiah Davidson mashiah.davidson at gmail.com
Tue Mar 30 20:47:55 UTC 2010


> I suggest you
> contact someone who will attend the meeting, and discuss the issue with them.

Thank you I think I've already found such a person.

> Anyway, if using MySQL's memory tables consumes too much resources, perhaps
> consider alternatives? Have you looked at network analysis frameworks like JUNG
> (Java) or SNAP (C++)? Relational databases are not good at managing linked
> structures like trees and graphs anyway.

My view on MySQL capabilities was different. The first thought is that
task involves memory intensive computations, i.e. it will use lots of
data to produce considerably low amount of results. Operations with
memory take major part of the common analysis complexity, that's why
it is reasonable to involve an engine especially targeted to work with
data efficiently. By efficiency here I mean mostly processing speed.
Indeed, the idea was to mark isolated articles with templates to make
autors aware of the issue. Practice shown that templates are to be set
with use of actual data, which means it is not good if a bot works for
many hours. The other thing from practice is that templates are to be
set near dayly, overwise autors lose attention to their creatures.

Yes, it takes lots of memory because MEMORY engine stores varchar data
in an inefficient way and spends lots of memory for indexes, but on
the other hand the processing takes just 1-2 hours for such wiki as ru
or de. My estimates for offline implementation made on the initial
stage gave me much higher estimate on processing results actuality,
that's why I've chosen sql.

> The memory requirements shouldn't be that huge anyway: two IDs per edge = 8
> byte. The German language Wikipedia for instance has about 13 million links in
> the main namespace, 8*|E| would need about 1GB even for a naive implementation.
> With a little more effort, it can be nearly halved to 4*|E|+4*|V|.

My data for dewiki is different. The amount of links between articles
(excluding disambigs) after redirects throwing is around 33 milion.
The source is here: http://toolserver.org/~mashiah/isolated/de.log.
One may find lots of other interesting statistics there.

> I have used the trivial edge store for analyzing the category structure before,
> and Neil Harris is currently working on an nice standalone implementation of
> this for Wikimedia Germany. This should allow recursive category lookup in
> microseconds.

I think category tree analysis takes (which is also there) - worst
case - minutes for relatively large wiki (7 minutes for about 150
small wikipedias). On the output the categorytree graph is split into
strongly connected components. With offline application just data
download from the database could take more than Golem's processing
time.

mashiah



More information about the Toolserver-l mailing list