Robert Rohde:
Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files ....(source linked from [1]).
As you suggest, one could use the properties of .bz2 format to
parallelize that. I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast.
Some additional notes:
Yes wikistats processes compressed dumps.
Nowadays these are mostly stub dumps.
Most monthly metrics can be collected here, with few exceptions like
word count.
For stub dumps decompression is the major resource hog,
for full dumps some heavy regexp's do contribute considerably.
Wikistats could benefit a lot from parallelization (although these days
dump production for larger wikis is generally the bottleneck).
First thing I would want to look into (some day) is running the count
scripts for several wikis in parallel.
All intermediate data are stored in csv files, often one file for one
metric for all languages.
Decoupling and aggregation as post processing step is simple.
Running several count threads on one machine might tax memory.
Some hashes are huge (much has been externalized, but e.g. edits per
user per namespace is still a hash file).
The basic structure dates from the time that a full archive dump for
English Wikipedia was processed in minutes rather than months.
There have been a lot of optimizations , but general setup is still like
this:
Every months all counts for past 10 years are reproduced from scratch.
Wikistats basically has no memory.
This probably sounds crazy, incremental processing has been suggested
more than once.
Main reason to keep it this way is: ever so often new functionality is
added to the scripts (and the occasional bug fix)
In order to have new counts for full history we would need to rerun from
scratch ever so often anyway.
People asked me how come the counts can change from to month to month.
Same answer: counts are redone for all months, newer dumps will have
more deletions for earlier months.
Although this mostly effects last two months: nearly all deletions occur
within a month or two.
In early years deletions were very rare. most were done to prevent court
orders (privacy).
Nowadays deletionism has taken hold.
Still wikistats treats deleted content as 'should not have been there in
the first place'.
This makes our editor counts somewhat conservative, basically skews the
activity patterns in favor of good content contributors.
Erik Zachte