Re: [Wikitech-l] image dump status update1

11 Sep 2010


      On 9/10/2010 6:14 PM, Jamie Morken wrote:
...
Hi,
I did some "testing" on Domas' pagecounts log files:
original file: pagecounts-20100910-040000.gz downloaded from: http://dammit.lt/wikistats/
the original file "pagecounts-20100910-040000.gz" was parsed to remove all lines except those 
beginning with "en File".  This shows what files were downloaded in that hour, mostly images but further
parsing is needed to remove non-image files (ie. *.ogg audio etc)
example parsed line from pagecounts-20100910-040000.gz:
en File:Alexander_Karelin.jpg 1 9238
the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes transferred, which
depends on what image scaling was used
it is located at: "http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg" and linked from the page: 
http://en.wikipedia.org/wiki/Aleksandr_Karelin
We also may want to parse out the lines that begin with "commons.m File" and "commons.m Image" from
the pagecounts file as they also contain image links
after we parse the pagecounts files down to image links only, then we can merge them together, the more 
we merge the better our image view data will be for sorting the image list generated by wikix by view 
frequency.
Wikix has the complete list of images for the wiki we are creating an image dump for, so any extra 
images from these pagecounts files that aren't in wikix's image list won't be added to the image dump, 
and also images that are in wikix's list but not in the pagecounts files will still be added to the image dump,
but can be put into a tar file showing they are infrequently accessed.
I did the parsing manually with a txt editor, but for the next step of merging the pagecounts files we will 
need to make some scripts.
I think in the end we will not use wikix as it doesn't create a simple image list from the wiki's xml file.
That won't really give you that stats you want. That only gives you
pageviews for the file description page itself, and not articles that
use the image. I don't think there's any publicly available stats for
the latter, though you could estimate it rather well using the dumps for
the imagelinks and page database tables, then correlating hits for
articles with the images that they contain.
-- 
Alex (wikipedia:en:User:Mr.Z-man)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] image dump status update1