[Toolserver-l] Golem issues

River Tarnell river.tarnell at wikimedia.de
Thu Apr 1 00:26:07 UTC 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mashiah Davidson:
> > * Build a graph of Wikipedia articles in the main namespace, with
> >  wikilinks as vertexes.  Since some pages are not reachable from other
> >  pages, this is actually N disconnected graphs.
> > * Remove all edges which refer to disambiguation pages, date pages, or
> >  lists
> > * Remove the graph which contains the main page
> > * Produce a list of all remaining graphs.

> > Is that roughly correct?
 
> It is roughly correct description of one of Golem's processing stages.

Okay, so for this part, my implementation can load all links from
ruwiki, and analyse them into 528 disconnected subgraphs (most of which
contain only a single isolated page) in 85 seconds.  In total it uses
about 200MB RAM on the system it runs on, and no MySQL tables.  

Does this seem reasonable?  The vast majority of the runtime is loading
the data; the actual processing only takes about 10 seconds, so adding
additional analysis should not increase the runtime significantly.

	- river.

% time ./judah -c defs/ruwiki
NOTE: Loading configuration from defs/ruwiki
NOTE: Using ruwiki_p on ruwiki-p.db.toolserver.org
NOTE: Connected to database.

Running...

NOTE: Estimated page count: 921950 (memory = 7.03MB)
Fetching pages: 1167075 (actual memory used = 8.90MB),
                skipped 0, list=0, year=0, date=0, disambig=0
Sorting pages...

NOTE: Estimated link count: 39461689 (memory = 301.07MB)
Fetching links: 24552488 (actual memory used = 187.32MB),
                skipped 2512244 links to invalid pages

Finding all subgraphs...
NOTE: Identified 529 distinct subgraphs
./judah -c defs/ruwiki  70.99s user 1.60s system 85% cpu 1:25.12 total
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (HP-UX)

iEYEARECAAYFAkuz6B8ACgkQIXd7fCuc5vLb8QCfZt5wRlSiVmaScHztlGPt+ez2
1WwAoKuxsQ11BMCCnLsSK5MNveYH77Xo
=fJUt
-----END PGP SIGNATURE-----



More information about the Toolserver-l mailing list