-----BEGIN PGP SIGNED MESSAGE-----
> * Build a graph of Wikipedia articles in the main
> wikilinks as vertexes. Since some pages are not reachable from other
> pages, this is actually N disconnected graphs.
> * Remove all edges which refer to disambiguation pages, date pages, or
> * Remove the graph which contains the main page
> * Produce a list of all remaining graphs.
> Is that roughly correct?
It is roughly correct description of one of
Golem's processing stages.
Okay, so for this part, my implementation can load all links from
ruwiki, and analyse them into 528 disconnected subgraphs (most of which
contain only a single isolated page) in 85 seconds. In total it uses
about 200MB RAM on the system it runs on, and no MySQL tables.
Does this seem reasonable? The vast majority of the runtime is loading
the data; the actual processing only takes about 10 seconds, so adding
additional analysis should not increase the runtime significantly.
% time ./judah -c defs/ruwiki
NOTE: Loading configuration from defs/ruwiki
NOTE: Using ruwiki_p on ruwiki-p.db.toolserver.org
NOTE: Connected to database.
NOTE: Estimated page count: 921950 (memory = 7.03MB)
Fetching pages: 1167075 (actual memory used = 8.90MB),
skipped 0, list=0, year=0, date=0, disambig=0
NOTE: Estimated link count: 39461689 (memory = 301.07MB)
Fetching links: 24552488 (actual memory used = 187.32MB),
skipped 2512244 links to invalid pages
Finding all subgraphs...
NOTE: Identified 529 distinct subgraphs
./judah -c defs/ruwiki 70.99s user 1.60s system 85% cpu 1:25.12 total
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (HP-UX)
-----END PGP SIGNATURE-----