Re: [Toolserver-l] Golem issues

1 Apr 2010

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Mashiah Davidson:
...
...

Build a graph of Wikipedia articles in the main namespace, with

wikilinks as vertexes.  Since some pages are not reachable from other
 pages, this is actually N disconnected graphs.

Remove all edges which refer to disambiguation pages, date pages, or

lists

Remove the graph which contains the main page
Produce a list of all remaining graphs.

...
...
Is that roughly correct?
...
It is roughly correct description of one of Golem's processing stages.
Okay, so for this part, my implementation can load all links from
ruwiki, and analyse them into 528 disconnected subgraphs (most of which
contain only a single isolated page) in 85 seconds.  In total it uses
about 200MB RAM on the system it runs on, and no MySQL tables.
Does this seem reasonable?  The vast majority of the runtime is loading
the data; the actual processing only takes about 10 seconds, so adding
additional analysis should not increase the runtime significantly.
- river.
% time ./judah -c defs/ruwiki
NOTE: Loading configuration from defs/ruwiki
NOTE: Using ruwiki_p on ruwiki-p.db.toolserver.org
NOTE: Connected to database.
Running...
NOTE: Estimated page count: 921950 (memory = 7.03MB)
Fetching pages: 1167075 (actual memory used = 8.90MB),
                skipped 0, list=0, year=0, date=0, disambig=0
Sorting pages...
NOTE: Estimated link count: 39461689 (memory = 301.07MB)
Fetching links: 24552488 (actual memory used = 187.32MB),
                skipped 2512244 links to invalid pages
Finding all subgraphs...
NOTE: Identified 529 distinct subgraphs
./judah -c defs/ruwiki  70.99s user 1.60s system 85% cpu 1:25.12 total
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (HP-UX)

iEYEARECAAYFAkuz6B8ACgkQIXd7fCuc5vLb8QCfZt5wRlSiVmaScHztlGPt+ez2
1WwAoKuxsQ11BMCCnLsSK5MNveYH77Xo
=fJUt
-----END PGP SIGNATURE-----

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Golem issues