Re: [Toolserver-l] Golem issues

31 Mar 2010

On Wed, Mar 31, 2010 at 9:27 PM, Mashiah Davidson
&lt;mashiah.davidson(a)gmail.com&gt; wrote:
...
   * Build a
graph of Wikipedia articles in the main namespace, with
  wikilinks as vertexes.  Since some pages are not reachable from other
  pages, this is actually N disconnected graphs.
 * Remove all edges which refer to disambiguation pages, date pages, or
  lists
 * Remove the graph which contains the main page
 * Produce a list of all remaining graphs.

 Is that roughly correct? 
 It is roughly correct description of one of Golem's processing stages. 
Well, I just did the experiment for German Wikipedia, using page_id
pairs in a temporary, non-memory table. In my user database (on the
same server as dewiki_p):

mysql> create temporary table delinks ( pid1 INTEGER , pid2 INTEGER )
ENGINE=InnoDB ;

mysql> INSERT /* SLOW_OK */ INTO delinks ( pid1,pid2 ) select
p1.page_id AS pid1,p2.page_id AS pod2 from dewiki_p.page AS
p1,dewiki_p.page AS p2,dewiki_p.pagelinks WHERE pl_title=p2.page_title
and p2.page_namespace=0 and pl_namespace=0 and p1.page_id=pl_from and
p1.page_namespace=0 ;

Query OK, 34964160 rows affected (32 min 59.29 sec)
Records: 34964160  Duplicates: 0  Warnings: 0

So, 35 million link pairs between namespace-0 pages, created in 33
minutes (~1 million links per minute). That's not too bad for our #2
wikipedia, and seems perfectly managable.

Depending on your usage, now add indices and spices :-)

Magnus

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Golem issues