Article length - toolserver API - Wikitech-l

8 Aug 2007

I want to find out the length of a bunch of articles.  I have 
earlier done this for the Swedish Wikipedia by importing the 
page.sql dump into a local MySQL instance, which works just fine.

But now that I try it for the English Wikipedia, the database 
import (of 10 million rows, averaging 94 bytes) appears to take 
somewhere between 24 and 48 hours (with keys disabled, I'm 
importing some 4500 rows per minute).  This seems a bit 
unnecessary for just finding out the length of some 1000 articles.  
Especially if I want to do it again when the next dump becomes 
available.  Is there some API on the toolserver, that I can use 
instead?  Or should I consider retrieving the action=raw from the 
live server and just count the bytes?  Where do I start?

I could even write a Perl script that parses the insert statements 
in page.sql and extracts the information I need, all in one pass.  
But this is not really why a MySQL dump is created.

-- 
  Lars Aronsson (lars(a)aronsson.se)
  Aronsson Datateknik - http://aronsson.se