Hi,
I did some "testing" on Domas' pagecounts log files:
original file: pagecounts-20100910-040000.gz downloaded from: http://dammit.lt/wikistats/
the original file "pagecounts-20100910-040000.gz" was parsed to remove all lines except those beginning with "en File". This shows what files were downloaded in that hour, mostly images but further parsing is needed to remove non-image files (ie. *.ogg audio etc)
example parsed line from pagecounts-20100910-040000.gz:
en File:Alexander_Karelin.jpg 1 9238
the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes transferred, which depends on what image scaling was used
it is located at: "http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg" and linked from the page: http://en.wikipedia.org/wiki/Aleksandr_Karelin
We also may want to parse out the lines that begin with "commons.m File" and "commons.m Image" from the pagecounts file as they also contain image links
after we parse the pagecounts files down to image links only, then we can merge them together, the more we merge the better our image view data will be for sorting the image list generated by wikix by view frequency.
Wikix has the complete list of images for the wiki we are creating an image dump for, so any extra images from these pagecounts files that aren't in wikix's image list won't be added to the image dump, and also images that are in wikix's list but not in the pagecounts files will still be added to the image dump, but can be put into a tar file showing they are infrequently accessed.
I did the parsing manually with a txt editor, but for the next step of merging the pagecounts files we will need to make some scripts.
I think in the end we will not use wikix as it doesn't create a simple image list from the wiki's xml file.
cheers, Jamie
On 9/10/2010 6:14 PM, Jamie Morken wrote:
Hi,
I did some "testing" on Domas' pagecounts log files:
original file: pagecounts-20100910-040000.gz downloaded from: http://dammit.lt/wikistats/
the original file "pagecounts-20100910-040000.gz" was parsed to remove all lines except those beginning with "en File". This shows what files were downloaded in that hour, mostly images but further parsing is needed to remove non-image files (ie. *.ogg audio etc)
example parsed line from pagecounts-20100910-040000.gz:
en File:Alexander_Karelin.jpg 1 9238
the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes transferred, which depends on what image scaling was used
it is located at: "http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg" and linked from the page: http://en.wikipedia.org/wiki/Aleksandr_Karelin
We also may want to parse out the lines that begin with "commons.m File" and "commons.m Image" from the pagecounts file as they also contain image links
after we parse the pagecounts files down to image links only, then we can merge them together, the more we merge the better our image view data will be for sorting the image list generated by wikix by view frequency.
Wikix has the complete list of images for the wiki we are creating an image dump for, so any extra images from these pagecounts files that aren't in wikix's image list won't be added to the image dump, and also images that are in wikix's list but not in the pagecounts files will still be added to the image dump, but can be put into a tar file showing they are infrequently accessed.
I did the parsing manually with a txt editor, but for the next step of merging the pagecounts files we will need to make some scripts.
I think in the end we will not use wikix as it doesn't create a simple image list from the wiki's xml file.
That won't really give you that stats you want. That only gives you pageviews for the file description page itself, and not articles that use the image. I don't think there's any publicly available stats for the latter, though you could estimate it rather well using the dumps for the imagelinks and page database tables, then correlating hits for articles with the images that they contain.
wikitech-l@lists.wikimedia.org