[Toolserver-l] Golem issues
Magnus Manske
magnusmanske at googlemail.com
Wed Mar 31 22:46:49 UTC 2010
On Wed, Mar 31, 2010 at 9:27 PM, Mashiah Davidson
<mashiah.davidson at gmail.com> wrote:
>> * Build a graph of Wikipedia articles in the main namespace, with
>> wikilinks as vertexes. Since some pages are not reachable from other
>> pages, this is actually N disconnected graphs.
>> * Remove all edges which refer to disambiguation pages, date pages, or
>> lists
>> * Remove the graph which contains the main page
>> * Produce a list of all remaining graphs.
>>
>> Is that roughly correct?
>
> It is roughly correct description of one of Golem's processing stages.
Well, I just did the experiment for German Wikipedia, using page_id
pairs in a temporary, non-memory table. In my user database (on the
same server as dewiki_p):
mysql> create temporary table delinks ( pid1 INTEGER , pid2 INTEGER )
ENGINE=InnoDB ;
mysql> INSERT /* SLOW_OK */ INTO delinks ( pid1,pid2 ) select
p1.page_id AS pid1,p2.page_id AS pod2 from dewiki_p.page AS
p1,dewiki_p.page AS p2,dewiki_p.pagelinks WHERE pl_title=p2.page_title
and p2.page_namespace=0 and pl_namespace=0 and p1.page_id=pl_from and
p1.page_namespace=0 ;
Query OK, 34964160 rows affected (32 min 59.29 sec)
Records: 34964160 Duplicates: 0 Warnings: 0
So, 35 million link pairs between namespace-0 pages, created in 33
minutes (~1 million links per minute). That's not too bad for our #2
wikipedia, and seems perfectly managable.
Depending on your usage, now add indices and spices :-)
Magnus
More information about the Toolserver-l
mailing list