Hi all,
This is a preliminary list of what needs to be done to generate images dumps. If anyone can help with #2 to provide the access log of image usage stats please send me an email!
1. run wikix to generate list of images for a given wiki ie. enwiki
2. sort the image list based on usage frequency from access log files
3. download images over HTTP
4. scale images down to desired thumbnail size
5. tar image collection into multiple tar files (first tar file has the most commonly used images)
6. post to Wikimedia image dump or bittorrent
cheers, Jamie
On 09/09/2010 10:54 PM, Jamie Morken wrote:
Hi all,
If anyone can help with #2 to provide the access log of image usage stats please send me an email! 2. sort the image list based on usage frequency from access log files
The raw data is one file per hour, containing a list of page names and visit counts. From just one such file, you get statistics on what's the most visited pages during that particular hour. By combining more files, you can get statistics for a whole day, a week, a month, a year, all Mondays, all 7am hours around the year, the 3rd Sunday after Easter, or whatever. The combinations are almost endless.
How do we boil this down to a few datasets that are most useful? Is that the total visit count per month? Or what?
Are these visitor stats already in a database on the toolserver? If so, how are they organized?
I wrote some documentation on the access log format here, http://www.archive.org/details/wikipedia_visitor_stats_200712
Hi Lars, are you going to upload more logs to Internet Archive? Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place.
2010/9/10 Lars Aronsson lars@aronsson.se
On 09/09/2010 10:54 PM, Jamie Morken wrote:
Hi all,
If anyone can help with #2 to provide the access log of image usage stats
please send me an email!
- sort the image list based on usage frequency from access log files
The raw data is one file per hour, containing a list of page names and visit counts. From just one such file, you get statistics on what's the most visited pages during that particular hour. By combining more files, you can get statistics for a whole day, a week, a month, a year, all Mondays, all 7am hours around the year, the 3rd Sunday after Easter, or whatever. The combinations are almost endless.
How do we boil this down to a few datasets that are most useful? Is that the total visit count per month? Or what?
Are these visitor stats already in a database on the toolserver? If so, how are they organized?
I wrote some documentation on the access log format here, http://www.archive.org/details/wikipedia_visitor_stats_200712
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On September 10, emijrp wrote:
Hi Lars, are you going to upload more logs to Internet Archive?
No, I can't. I have not downloaded more recent logs. I only uploaded what was on my disk, because I needed to free some space.
Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place.
"Must"? Says who? That sounds like a naive opinion. If you have an interest, you can do the job. Otherwise they will get lost. In the future, maybe this should be a task for the paid staff, but so far it has not been.
Thanks! : )
2010/9/17 Lars Aronsson lars@aronsson.se
On September 10, emijrp wrote:
Hi Lars, are you going to upload more logs to Internet Archive?
No, I can't. I have not downloaded more recent logs. I only uploaded what was on my disk, because I needed to free some space.
Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place.
"Must"? Says who? That sounds like a naive opinion. If you have an interest, you can do the job. Otherwise they will get lost. In the future, maybe this should be a task for the paid staff, but so far it has not been.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Sep 9, 2010 at 10:54 PM, Jamie Morken jmorken@shaw.ca wrote:
Hi all,
This is a preliminary list of what needs to be done to generate images dumps. If anyone can help with #2 to provide the access log of image usage stats please send me an email!
run wikix to generate list of images for a given wiki ie. enwiki
sort the image list based on usage frequency from access log files
Hi,
It will be great to have these image dumps ! I wonder if a different dump my be worth it for a different scenario:
* User only wants to get the photos for a small set of ids i.e. 1000 pages
What would be the proper way to get these photos without downloading large dumps ?
a. Parse the actual html pages and get the actual image urls (plus license info and then download the images) ?
b. Try to find the actual image urls using the commons wikitext dump (and parse license info, ..) ?
Both approaches seem complicated so maybe a different dump would be helpful:
Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]
regards
2010/9/10 Jose jmalv04@gmail.com:
Both approaches seem complicated so maybe a different dump would be helpful:
Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]
http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&i...
Returns image URL, width, height and thumbnail URL for a 200px thumbnail.
Roan Kattouw (Catrope)
On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
Both approaches seem complicated so maybe a different dump would be helpful:
Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]
http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&i...
Returns image URL, width, height and thumbnail URL for a 200px thumbnail.
Thanks, this may be useful. So let's say I want to get all images for the Ant page, the steps will be:
1. Parse the Ant page wikitext and get all Image: links
2. For every image link get it's commons page id (Can I issue the above query using the title ids instead on number ids ? . If not, then use the commons repository to map image title to number id)
3. Issue a query like the one you detail above (but the results don't show license info !).
Still, I think having a small dump with metadata is better than sending a lot of api queries
thanks
On Fri, Sep 10, 2010 at 3:09 PM, Jose jmalv04@gmail.com wrote:
On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
Both approaches seem complicated so maybe a different dump would be helpful:
Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]
http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&i...
Returns image URL, width, height and thumbnail URL for a 200px thumbnail.
Thanks, this may be useful. So let's say I want to get all images for the Ant page, the steps will be:
Just use prop=images as a generator on en.wikipedia.org. This will yield the thumb urls as well as the urls of the commons pages, which can then be fetched separately.
Bryan
2010/9/10 Bryan Tong Minh bryan.tongminh@gmail.com:
Just use prop=images as a generator on en.wikipedia.org. This will yield the thumb urls as well as the urls of the commons pages, which can then be fetched separately.
Concrete example:
http://en.wikipedia.org/w/api.php?action=query&generator=images&giml...
Licensing info is not available through the API because it's just some text or template on the image description page; it has no meaning to the MediaWiki software.
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org