[Toolserver-l] Tracking external link statistics over time

Jan Luca jan at jans-seite.de
Tue Nov 29 21:11:01 UTC 2011


Have you read the part in the wiki about "SLOW_OK" [1]?

[1]
https://wiki.toolserver.org/view/Databases#Slow_queries_and_the_query_killer


Best regards,
Jan

-----Ursprüngliche Nachricht-----
Von: toolserver-l-bounces at lists.wikimedia.org
[mailto:toolserver-l-bounces at lists.wikimedia.org] Im Auftrag von Lars
Aronsson
Gesendet: Dienstag, 29. November 2011 18:49
An: toolserver-l at lists.wikimedia.org
Betreff: Re: [Toolserver-l] Tracking external link statistics over time

On 11/25/2011 10:20 PM, Lars Aronsson wrote:
> This query compiles a list of external links (from the main namespace) 
> grouped by website:
>
> select count(*), trim(leading 'www.' from
substring_index(substring_index(el_to, '/', 3),'/',-1)) as site
>    from externallinks, page
>    where el_from = page_id and page_namespace=0
>    group by 2  having count(*)>   10 order by 1;

In the last few days, I've learned many things, such as
1) using the rrdb instead of userdb servers,
2) using SGE to submit jobs that require a particular database cluster,
3) getting a list of databases in a cluster, looping over them,
4) including the File, Author, Index, and Portal namespaces.

This is going great. Looping over all of the 516 databases in the s3 cluster
takes 17 minutes, or an average of 2 seconds per query. This is very
reasonable. The 17 databases in s2 take 47 minutes or 151 seconds per query,
which is okay.

But enwiki is too large to execute this query within the allowed 30 minutes.
Even a simple "select count(*) from externallinks;"
timed out on me. For Wikimedia Commons, the query times out when I try to
include external links found in the File (6) namespace, but if I only count
the article (0) namespace (where there are very few external links), it
produces a fine result.

Does the toolserver give me any other options to get statistics on the
enwiki database? I could load page.sql and externallinks.sql into my own
MySQL instance, now that these dumps are produced on a monthly schedule. Is
that my best option? Can I set up my own MySQL for this on the toolserver?
Or should I move the project to my own server?

So far, I output the result to text files. Later I might want to store it in
a database, but this is no hurry.


-- 
   Lars Aronsson (lars at aronsson.se)
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list (Toolserver-l at lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette




More information about the Toolserver-l mailing list