On Wed, Mar 31, 2010 at 9:27 PM, Mashiah Davidson
<mashiah.davidson(a)gmail.com> wrote:
* Build a
graph of Wikipedia articles in the main namespace, with
wikilinks as vertexes. Since some pages are not reachable from other
pages, this is actually N disconnected graphs.
* Remove all edges which refer to disambiguation pages, date pages, or
lists
* Remove the graph which contains the main page
* Produce a list of all remaining graphs.
Is that roughly correct?
It is roughly correct description of one of Golem's processing stages.
Well, I just did the experiment for German Wikipedia, using page_id
pairs in a temporary, non-memory table. In my user database (on the
same server as dewiki_p):
mysql> create temporary table delinks ( pid1 INTEGER , pid2 INTEGER )
ENGINE=InnoDB ;
mysql> INSERT /* SLOW_OK */ INTO delinks ( pid1,pid2 ) select
p1.page_id AS pid1,p2.page_id AS pod2 from dewiki_p.page AS
p1,dewiki_p.page AS p2,dewiki_p.pagelinks WHERE pl_title=p2.page_title
and p2.page_namespace=0 and pl_namespace=0 and p1.page_id=pl_from and
p1.page_namespace=0 ;
Query OK, 34964160 rows affected (32 min 59.29 sec)
Records: 34964160 Duplicates: 0 Warnings: 0
So, 35 million link pairs between namespace-0 pages, created in 33
minutes (~1 million links per minute). That's not too bad for our #2
wikipedia, and seems perfectly managable.
Depending on your usage, now add indices and spices :-)
Magnus