I've been an enthusiastic downloader and user of Domas' wikistats
or page counter logfiles since they first appeared on December 9,
http://lists.wikimedia.org/pipermail/wikitech-l/2007-December/035435.html
One problem, however, is that they are plain text files that need
to be loaded into some kind of database system before you can do
any interesting analysis. Whether text files or XML or MySQL
dumps, they all take quite some time to import. It's like
unpacking a huge tar archive, rather than instantaneously mounting
an ISO 9660 image, if you get the analogy. You could probably
load the data into MySQL and then distribute the raw tablespace
files, but I haven't heard of any project that does this. MySQL
wasn't built with this in mind. The data could be loaded into
MySQL at the toolserver (maybe someone did already?), and we could
each run our queries there, but that doesn't scale if many people
want to run heavy queries.
When looking around, I found SQLite (
www.sqlite.org), a free
software (public domain, actually) light-weight SQL database
engine. It is server-less and runs entirely as a one-user
application, storing tables and indexes in a plain file.
As an experiment, I loaded the first two months of page counter
log files for the Swedish Wikipedia into SQLite (version 3). The
resulting databse file is 3.1 GB, which bzip2 shrinks to 638 MB.
The idea is that you can download these 638 MB, run bunzip2, and
then start sqlite3 and run SQL queries right away. It doesn't
take many minutes to get started.
Now I want to find out if this is a useful scheme. Then I could
set up a process to provide such SQLite dumps for all languages.
But perhaps some parts need adjustment or tuning. I need your
feedback for this. You'll have to do analyze the Swedish
Wikipedia initially, since that's all I provide for now.
Here's what I have done:
First I decode the URL encoding and normalize some page names.
Main_Page, Main+Page and Main%20Page are all converted to
Main_Page, and even translated to Huvudsida for the Swedish
Wikipedia. That means I add up the page counters for these page
names and store them under Huvudsida. All page names are stored in
one table (names) and given an integer primary key (names.id), to
avoid duplicate storage of text strings. The names table now has
1.9 million entries.
Each logfile covers one hour, timestamped in UTC. A table called
"times" uses "unix seconds" as a primary key (times.unix) and
lists the year, month, day-of-month, week, day-of-week, and hour.
Perhaps this was unnecessary given the date and time functions
provided by SQLite, but I still believe it can be helpful. For
these 2 months (62 days), the times table has 1734 entries.
The big "counts" table contains the language ("sv") as a text
field and integer fields for time (references times.unix), name
(references names.id) and count. The counts table has 68.3
million entries.
A typical query you can run is
select sum(count), year, month, mday
from counts, times, names
where names.name='Huvudsida'
and year=2007
and names.id=counts.name and counts.time=times.unix
group by 2,3,4;
Queries are not necessarily fast, but you can create indexes as
you wish. Are there any indexes you would like me to build and
supply as part of the distributed database file?
The query above returns this result:
32403|2007|12|9
119005|2007|12|10
117551|2007|12|11
107630|2007|12|12
102178|2007|12|13
88766|2007|12|14
65733|2007|12|15
87048|2007|12|16
106643|2007|12|17
96751|2007|12|18
86955|2007|12|19
74297|2007|12|20
63383|2007|12|21
57908|2007|12|22
59360|2007|12|23
45230|2007|12|24
56469|2007|12|25
58494|2007|12|26
66068|2007|12|27
63538|2007|12|28
65137|2007|12|29
68636|2007|12|30
55821|2007|12|31
Currently, you'll find the database file (both compressed and not)
at
http://mirabell.runeberg.lysator.liu.se/
Here's what you need to do (UNIX/Linux commands):
sudo apt-get install bzip2
sudo apt-get install sqlite3
wget
http://mirabell.runeberg.lysator.liu.se/sv-counts-20080219.db.bz2
bunzip2 sv-counts-20080219.db.bz2
sqlite3 sv-counts-20080219.db
That URL is not permanent, but only available for the current
test.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se