Wikistats in SQLite3 - Wikitech-l

21 Feb 2008

I've been an enthusiastic downloader and user of Domas' wikistats 
or page counter logfiles since they first appeared on December 9,
http://lists.wikimedia.org/pipermail/wikitech-l/2007-December/035435.html

One problem, however, is that they are plain text files that need 
to be loaded into some kind of database system before you can do 
any interesting analysis.  Whether text files or XML or MySQL 
dumps, they all take quite some time to import.  It's like 
unpacking a huge tar archive, rather than instantaneously mounting 
an ISO 9660 image, if you get the analogy.  You could probably 
load the data into MySQL and then distribute the raw tablespace 
files, but I haven't heard of any project that does this.  MySQL 
wasn't built with this in mind.  The data could be loaded into 
MySQL at the toolserver (maybe someone did already?), and we could 
each run our queries there, but that doesn't scale if many people 
want to run heavy queries.

When looking around, I found SQLite (www.sqlite.org), a free 
software (public domain, actually) light-weight SQL database 
engine.  It is server-less and runs entirely as a one-user 
application, storing tables and indexes in a plain file.

As an experiment, I loaded the first two months of page counter 
log files for the Swedish Wikipedia into SQLite (version 3).  The 
resulting databse file is 3.1 GB, which bzip2 shrinks to 638 MB. 
The idea is that you can download these 638 MB, run bunzip2, and 
then start sqlite3 and run SQL queries right away.  It doesn't 
take many minutes to get started.

Now I want to find out if this is a useful scheme.  Then I could 
set up a process to provide such SQLite dumps for all languages. 
But perhaps some parts need adjustment or tuning.  I need your 
feedback for this.  You'll have to do analyze the Swedish 
Wikipedia initially, since that's all I provide for now.

Here's what I have done:

First I decode the URL encoding and normalize some page names.  
Main_Page, Main+Page and Main%20Page are all converted to 
Main_Page, and even translated to Huvudsida for the Swedish 
Wikipedia.  That means I add up the page counters for these page 
names and store them under Huvudsida. All page names are stored in 
one table (names) and given an integer primary key (names.id), to 
avoid duplicate storage of text strings.  The names table now has 
1.9 million entries.

Each logfile covers one hour, timestamped in UTC.  A table called 
"times" uses "unix seconds" as a primary key (times.unix) and 
lists the year, month, day-of-month, week, day-of-week, and hour. 
Perhaps this was unnecessary given the date and time functions 
provided by SQLite, but I still believe it can be helpful.  For 
these 2 months (62 days), the times table has 1734 entries.

The big "counts" table contains the language ("sv") as a text 
field and integer fields for time (references times.unix), name 
(references names.id) and count.  The counts table has 68.3 
million entries.

A typical query you can run is

   select sum(count), year, month, mday
   from counts, times, names
   where names.name='Huvudsida'
     and year=2007
     and names.id=counts.name and counts.time=times.unix
     group by 2,3,4;

Queries are not necessarily fast, but you can create indexes as 
you wish.  Are there any indexes you would like me to build and 
supply as part of the distributed database file?

The query above returns this result:

32403|2007|12|9
119005|2007|12|10
117551|2007|12|11
107630|2007|12|12
102178|2007|12|13
88766|2007|12|14
65733|2007|12|15
87048|2007|12|16
106643|2007|12|17
96751|2007|12|18
86955|2007|12|19
74297|2007|12|20
63383|2007|12|21
57908|2007|12|22
59360|2007|12|23
45230|2007|12|24
56469|2007|12|25
58494|2007|12|26
66068|2007|12|27
63538|2007|12|28
65137|2007|12|29
68636|2007|12|30
55821|2007|12|31

Currently, you'll find the database file (both compressed and not) 
at http://mirabell.runeberg.lysator.liu.se/

Here's what you need to do (UNIX/Linux commands):

  sudo apt-get install bzip2
  sudo apt-get install sqlite3

  wget http://mirabell.runeberg.lysator.liu.se/sv-counts-20080219.db.bz2
  bunzip2 sv-counts-20080219.db.bz2
  sqlite3 sv-counts-20080219.db

That URL is not permanent, but only available for the current 
test.

-- 
  Lars Aronsson (lars(a)aronsson.se)
  Aronsson Datateknik - http://aronsson.se