I've been an enthusiastic downloader and user of Domas' wikistats or page counter logfiles since they first appeared on December 9, http://lists.wikimedia.org/pipermail/wikitech-l/2007-December/035435.html
One problem, however, is that they are plain text files that need to be loaded into some kind of database system before you can do any interesting analysis. Whether text files or XML or MySQL dumps, they all take quite some time to import. It's like unpacking a huge tar archive, rather than instantaneously mounting an ISO 9660 image, if you get the analogy. You could probably load the data into MySQL and then distribute the raw tablespace files, but I haven't heard of any project that does this. MySQL wasn't built with this in mind. The data could be loaded into MySQL at the toolserver (maybe someone did already?), and we could each run our queries there, but that doesn't scale if many people want to run heavy queries.
When looking around, I found SQLite (www.sqlite.org), a free software (public domain, actually) light-weight SQL database engine. It is server-less and runs entirely as a one-user application, storing tables and indexes in a plain file.
As an experiment, I loaded the first two months of page counter log files for the Swedish Wikipedia into SQLite (version 3). The resulting databse file is 3.1 GB, which bzip2 shrinks to 638 MB. The idea is that you can download these 638 MB, run bunzip2, and then start sqlite3 and run SQL queries right away. It doesn't take many minutes to get started.
Now I want to find out if this is a useful scheme. Then I could set up a process to provide such SQLite dumps for all languages. But perhaps some parts need adjustment or tuning. I need your feedback for this. You'll have to do analyze the Swedish Wikipedia initially, since that's all I provide for now.
Here's what I have done:
First I decode the URL encoding and normalize some page names. Main_Page, Main+Page and Main%20Page are all converted to Main_Page, and even translated to Huvudsida for the Swedish Wikipedia. That means I add up the page counters for these page names and store them under Huvudsida. All page names are stored in one table (names) and given an integer primary key (names.id), to avoid duplicate storage of text strings. The names table now has 1.9 million entries.
Each logfile covers one hour, timestamped in UTC. A table called "times" uses "unix seconds" as a primary key (times.unix) and lists the year, month, day-of-month, week, day-of-week, and hour. Perhaps this was unnecessary given the date and time functions provided by SQLite, but I still believe it can be helpful. For these 2 months (62 days), the times table has 1734 entries.
The big "counts" table contains the language ("sv") as a text field and integer fields for time (references times.unix), name (references names.id) and count. The counts table has 68.3 million entries.
A typical query you can run is
select sum(count), year, month, mday from counts, times, names where names.name='Huvudsida' and year=2007 and names.id=counts.name and counts.time=times.unix group by 2,3,4;
Queries are not necessarily fast, but you can create indexes as you wish. Are there any indexes you would like me to build and supply as part of the distributed database file?
The query above returns this result:
32403|2007|12|9 119005|2007|12|10 117551|2007|12|11 107630|2007|12|12 102178|2007|12|13 88766|2007|12|14 65733|2007|12|15 87048|2007|12|16 106643|2007|12|17 96751|2007|12|18 86955|2007|12|19 74297|2007|12|20 63383|2007|12|21 57908|2007|12|22 59360|2007|12|23 45230|2007|12|24 56469|2007|12|25 58494|2007|12|26 66068|2007|12|27 63538|2007|12|28 65137|2007|12|29 68636|2007|12|30 55821|2007|12|31
Currently, you'll find the database file (both compressed and not) at http://mirabell.runeberg.lysator.liu.se/
Here's what you need to do (UNIX/Linux commands):
sudo apt-get install bzip2 sudo apt-get install sqlite3
wget http://mirabell.runeberg.lysator.liu.se/sv-counts-20080219.db.bz2 bunzip2 sv-counts-20080219.db.bz2 sqlite3 sv-counts-20080219.db
That URL is not permanent, but only available for the current test.