I want to find out the length of a bunch of articles. I have
earlier done this for the Swedish Wikipedia by importing the
page.sql dump into a local MySQL instance, which works just fine.
But now that I try it for the English Wikipedia, the database
import (of 10 million rows, averaging 94 bytes) appears to take
somewhere between 24 and 48 hours (with keys disabled, I'm
importing some 4500 rows per minute). This seems a bit
unnecessary for just finding out the length of some 1000 articles.
Especially if I want to do it again when the next dump becomes
available. Is there some API on the toolserver, that I can use
instead? Or should I consider retrieving the action=raw from the
live server and just count the bytes? Where do I start?
I could even write a Perl script that parses the insert statements
in page.sql and extracts the information I need, all in one pass.
But this is not really why a MySQL dump is created.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se