On 11/25/2011 10:20 PM, Lars Aronsson wrote:
This query compiles a list of external links (from the main namespace) grouped by website:
select count(*), trim(leading 'www.' from substring_index(substring_index(el_to, '/', 3),'/',-1)) as site from externallinks, page where el_from = page_id and page_namespace=0 group by 2 having count(*)> 10 order by 1;
In the last few days, I've learned many things, such as 1) using the rrdb instead of userdb servers, 2) using SGE to submit jobs that require a particular database cluster, 3) getting a list of databases in a cluster, looping over them, 4) including the File, Author, Index, and Portal namespaces.
This is going great. Looping over all of the 516 databases in the s3 cluster takes 17 minutes, or an average of 2 seconds per query. This is very reasonable. The 17 databases in s2 take 47 minutes or 151 seconds per query, which is okay.
But enwiki is too large to execute this query within the allowed 30 minutes. Even a simple "select count(*) from externallinks;" timed out on me. For Wikimedia Commons, the query times out when I try to include external links found in the File (6) namespace, but if I only count the article (0) namespace (where there are very few external links), it produces a fine result.
Does the toolserver give me any other options to get statistics on the enwiki database? I could load page.sql and externallinks.sql into my own MySQL instance, now that these dumps are produced on a monthly schedule. Is that my best option? Can I set up my own MySQL for this on the toolserver? Or should I move the project to my own server?
So far, I output the result to text files. Later I might want to store it in a database, but this is no hurry.