-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mashiah Davidson:
- Build a graph of Wikipedia articles in the main namespace, with
wikilinks as vertexes. Since some pages are not reachable from other pages, this is actually N disconnected graphs.
- Remove all edges which refer to disambiguation pages, date pages, or
lists
- Remove the graph which contains the main page
- Produce a list of all remaining graphs.
Is that roughly correct?
It is roughly correct description of one of Golem's processing stages.
Okay, so for this part, my implementation can load all links from ruwiki, and analyse them into 528 disconnected subgraphs (most of which contain only a single isolated page) in 85 seconds. In total it uses about 200MB RAM on the system it runs on, and no MySQL tables.
Does this seem reasonable? The vast majority of the runtime is loading the data; the actual processing only takes about 10 seconds, so adding additional analysis should not increase the runtime significantly.
- river.
% time ./judah -c defs/ruwiki NOTE: Loading configuration from defs/ruwiki NOTE: Using ruwiki_p on ruwiki-p.db.toolserver.org NOTE: Connected to database.
Running...
NOTE: Estimated page count: 921950 (memory = 7.03MB) Fetching pages: 1167075 (actual memory used = 8.90MB), skipped 0, list=0, year=0, date=0, disambig=0 Sorting pages...
NOTE: Estimated link count: 39461689 (memory = 301.07MB) Fetching links: 24552488 (actual memory used = 187.32MB), skipped 2512244 links to invalid pages
Finding all subgraphs... NOTE: Identified 529 distinct subgraphs ./judah -c defs/ruwiki 70.99s user 1.60s system 85% cpu 1:25.12 total