Re: [Toolserver-l] Server switch for s3/s4/s6, Monday morning UTC

30 Mar 2010


      ...
I suggest you
contact someone who will attend the meeting, and discuss the issue with them.
Thank you I think I've already found such a person.
...
Anyway, if using MySQL's memory tables consumes too much resources, perhaps
consider alternatives? Have you looked at network analysis frameworks like JUNG
(Java) or SNAP (C++)? Relational databases are not good at managing linked
structures like trees and graphs anyway.
My view on MySQL capabilities was different. The first thought is that
task involves memory intensive computations, i.e. it will use lots of
data to produce considerably low amount of results. Operations with
memory take major part of the common analysis complexity, that's why
it is reasonable to involve an engine especially targeted to work with
data efficiently. By efficiency here I mean mostly processing speed.
Indeed, the idea was to mark isolated articles with templates to make
autors aware of the issue. Practice shown that templates are to be set
with use of actual data, which means it is not good if a bot works for
many hours. The other thing from practice is that templates are to be
set near dayly, overwise autors lose attention to their creatures.
Yes, it takes lots of memory because MEMORY engine stores varchar data
in an inefficient way and spends lots of memory for indexes, but on
the other hand the processing takes just 1-2 hours for such wiki as ru
or de. My estimates for offline implementation made on the initial
stage gave me much higher estimate on processing results actuality,
that's why I've chosen sql.
...
The memory requirements shouldn't be that huge anyway: two IDs per edge = 8
byte. The German language Wikipedia for instance has about 13 million links in
the main namespace, 8*|E| would need about 1GB even for a naive implementation.
With a little more effort, it can be nearly halved to 4*|E|+4*|V|.
My data for dewiki is different. The amount of links between articles
(excluding disambigs) after redirects throwing is around 33 milion.
The source is here: http://toolserver.org/~mashiah/isolated/de.log.
One may find lots of other interesting statistics there.
...
I have used the trivial edge store for analyzing the category structure before,
and Neil Harris is currently working on an nice standalone implementation of
this for Wikimedia Germany. This should allow recursive category lookup in
microseconds.
I think category tree analysis takes (which is also there) - worst
case - minutes for relatively large wiki (7 minutes for about 150
small wikipedias). On the output the categorytree graph is split into
strongly connected components. With offline application just data
download from the database could take more than Golem's processing
time.
mashiah

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Server switch for s3/s4/s6, Monday morning UTC