[Toolserver-l] Golem issues

Mashiah Davidson mashiah.davidson at gmail.com
Thu Apr 1 20:39:20 UTC 2010


> Okay, so for this part, my implementation can load all links from
> ruwiki, and analyse them into 528 disconnected subgraphs (most of which
> contain only a single isolated page) in 85 seconds.  In total it uses
> about 200MB RAM on the system it runs on, and no MySQL tables.

As per my understanding there are much more than 528 disconnected
subgraphs in ruwiki.
The exact amount can be estimated with use of this page:
http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F:%D0%9F%D1%80%D0%BE%D0%B5%D0%BA%D1%82:%D0%A1%D0%B2%D1%8F%D0%B7%D0%BD%D0%BE%D1%81%D1%82%D1%8C/bytypes

Each orphan is belongs to a distinct disconnected subgraph, there are
18592 orphans there.
Let's for a while forget about long chains like _1_1 (orphan linking
another article) and just look for a lower bound for the proper
subgraphs amount.

There are also 930 articles in isolated pairs (_2), which ads 470
subgraphs to our lower bound, etc. Totally there should be not less
than 19 000.

Similar data for dewiki can be seen from
http://toolserver.org/~mashiah/isolated/de.log:

21567 orphans + 954/2 pair + etc gives us not less than 22 000
distinct subgraphs.

Currently I am not sure the difference is caused by the fact that
rules for links taking/not taking into account are different because
the difference in results looks too huge.

On the other hand, if the problem can indeed be resolved in such a
small amount of time, it seems great.

mashiah



More information about the Toolserver-l mailing list