Hi,
I did some "testing" on Domas' pagecounts log files:
original file: pagecounts-20100910-040000.gz downloaded from:
http://dammit.lt/wikistats/
the original file "pagecounts-20100910-040000.gz" was parsed to remove all lines
except those
beginning with "en File". This shows what files were downloaded in that hour,
mostly images but further
parsing is needed to remove non-image files (ie. *.ogg audio etc)
example parsed line from pagecounts-20100910-040000.gz:
en File:Alexander_Karelin.jpg 1 9238
the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes
transferred, which
depends on what image scaling was used
it is located at: "http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg" and
linked from the page:
http://en.wikipedia.org/wiki/Aleksandr_Karelin
We also may want to parse out the lines that begin with "commons.m File" and
"commons.m Image" from
the pagecounts file as they also contain image links
after we parse the pagecounts files down to image links only, then we can merge them
together, the more
we merge the better our image view data will be for sorting the image list generated by
wikix by view
frequency.
Wikix has the complete list of images for the wiki we are creating an image dump for, so
any extra
images from these pagecounts files that aren't in wikix's image list won't be
added to the image dump,
and also images that are in wikix's list but not in the pagecounts files will still be
added to the image dump,
but can be put into a tar file showing they are infrequently accessed.
I did the parsing manually with a txt editor, but for the next step of merging the
pagecounts files we will
need to make some scripts.
I think in the end we will not use wikix as it doesn't create a simple image list from
the wiki's xml file.
cheers,
Jamie