[Resending as plain text]
I maintain compacted monthly version of dammit.lt page view stats, starting with Jan 2010 (not an official WMF project). This is to preserve our page views counts for future historians (compare Twitter archive by Library of Congress) It could also be used to resurrect http://wikistics.falsikon.de/latest/wikipedia/en/ which was very popular. Alas the author vanished and does not reply on requests and we dont have the source code.
I just applied for storage on dataset1 or ..2, will publish the monthly < 2Gb files asap.
Each day I download 24 hourly dammit.lt files and compact these into one file. Each month I compact these into monthly file.
Major space saving: monthly files with all hourly page views is 8 Gb (compressed), with only articles with 5+ page views per month it is even less than 2 Gb.
This is because each page title occurs once instead of up to 24*31 times, and bytes sent field is omitted. All hourly counts are preserved, prefixed by day number and hour number.
Here are the first lines of one such file which also describes the format:
Erik Zachte (on wikibreak till Sep 12)
# Wikimedia article requests (aka page views) for year 2010, month 11 # # Each line contains four fields separated by spaces # - wiki code (subproject.project, see below) # - article title (encoding from original hourly files is preserved to maintain proper sort sequence) # - monthly total (possibly extrapolated from available data when hours/days in input were missing) # - hourly counts (only for hours where indeed article requests occurred) # # Subproject is language code, followed by project code # Project is b:wikibooks, k:wiktionary, n:wikinews, q:wikiquote, s:wikisource, v:wikiversity, z:wikipedia # Note: suffix z added by compression script: project wikipedia happens to be sorted last in dammit.lt files, so add this suffix to fix sort order # # To keep hourly counts compact and tidy both day and hour are coded as one character each, as follows: # Hour 0..23 shown as A..X convert to number: ordinal (char) - ordinal ('A') # Day 1..31 shown as A.._ 27=[ 28=\ 29=] 30=^ 31=_ convert to number: ordinal (char) - ordinal ('A') + 1 # # Original data source: Wikimedia full (=unsampled) squid logs # These data have been aggregated from hourly pagecount files at http://dammit.lt/wikistats, originally produced by Domas Mituzas # Daily and monthly aggregator script built by Erik Zachte # Each day hourly files for previous day are downloaded and merged into one file per day # Each month daily files are merged into one file per month # # This file contains only lines with monthly page request total greater/equal 5 # # Data for all hours of each day were available in input # aa.b File:Broom_icon.svg 6 AV1,IQ1,OT1,QB1,YT1,^K1 aa.b File:Wikimedia.png 7 BO1,BW1,CE1,EV1,LA1,TA1,^A1 aa.b File:Wikipedia-logo-de.png 5 BO1,CE1,EV1,LA1,TA1 aa.b File:Wikiversity-logo.png 7 AB1,BO1,CE1,EV1,LA1,TA1,[C1 aa.b File:Wiktionary-logo-de.png 5 CE1,CM1,EV1,TA1,^N1 aa.b File_talk:Commons-logo.svg 9 CE3,UO3,YE3 aa.b File_talk:Incubator-notext.svg 60 CH3,CL3,DB3,DG3,ET3,FH3,GM3,GO3,IA3,JQ3,KT3,LK3,LL3,MH3,OO3,PF3,XO3,[F3,[O3, ]P3 aa.b MediaWiki:Ipb_cant_unblock 5 BO1,JL1,XX1,[F2
wikitech-l@lists.wikimedia.org