image dump status update1 - Wikitech-l

11 Sep 2010

Hi,

I did some "testing" on Domas' pagecounts log files:

original file: pagecounts-20100910-040000.gz downloaded from: http://dammit.lt/wikistats/

the original file "pagecounts-20100910-040000.gz" was parsed to remove all lines
except those 
beginning with "en File".  This shows what files were downloaded in that hour,
mostly images but further
parsing is needed to remove non-image files (ie. *.ogg audio etc)

example parsed line from pagecounts-20100910-040000.gz:

en File:Alexander_Karelin.jpg 1 9238

the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes
transferred, which
depends on what image scaling was used

it is located at: "http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg" and
linked from the page: 
http://en.wikipedia.org/wiki/Aleksandr_Karelin

We also may want to parse out the lines that begin with "commons.m File" and
"commons.m Image" from
the pagecounts file as they also contain image links

after we parse the pagecounts files down to image links only, then we can merge them
together, the more 
we merge the better our image view data will be for sorting the image list generated by
wikix by view 
frequency.

Wikix has the complete list of images for the wiki we are creating an image dump for, so
any extra 
images from these pagecounts files that aren't in wikix's image list won't be
added to the image dump, 
and also images that are in wikix's list but not in the pagecounts files will still be
added to the image dump,
but can be put into a tar file showing they are infrequently accessed.

I did the parsing manually with a txt editor, but for the next step of merging the
pagecounts files we will 
need to make some scripts.

I think in the end we will not use wikix as it doesn't create a simple image list from
the wiki's xml file.

cheers,
Jamie