[Toolserver-l] Golem issues

Magnus Manske magnusmanske at googlemail.com
Wed Mar 31 22:46:49 UTC 2010


On Wed, Mar 31, 2010 at 9:27 PM, Mashiah Davidson
<mashiah.davidson at gmail.com> wrote:
>> * Build a graph of Wikipedia articles in the main namespace, with
>>  wikilinks as vertexes.  Since some pages are not reachable from other
>>  pages, this is actually N disconnected graphs.
>> * Remove all edges which refer to disambiguation pages, date pages, or
>>  lists
>> * Remove the graph which contains the main page
>> * Produce a list of all remaining graphs.
>>
>> Is that roughly correct?
>
> It is roughly correct description of one of Golem's processing stages.

Well, I just did the experiment for German Wikipedia, using page_id
pairs in a temporary, non-memory table. In my user database (on the
same server as dewiki_p):

mysql> create temporary table delinks ( pid1 INTEGER , pid2 INTEGER )
ENGINE=InnoDB ;

mysql> INSERT /* SLOW_OK */ INTO delinks ( pid1,pid2 ) select
p1.page_id AS pid1,p2.page_id AS pod2 from dewiki_p.page AS
p1,dewiki_p.page AS p2,dewiki_p.pagelinks WHERE pl_title=p2.page_title
and p2.page_namespace=0 and pl_namespace=0 and p1.page_id=pl_from and
p1.page_namespace=0 ;

Query OK, 34964160 rows affected (32 min 59.29 sec)
Records: 34964160  Duplicates: 0  Warnings: 0

So, 35 million link pairs between namespace-0 pages, created in 33
minutes (~1 million links per minute). That's not too bad for our #2
wikipedia, and seems perfectly managable.

Depending on your usage, now add indices and spices :-)

Magnus



More information about the Toolserver-l mailing list