On Thu, Sep 9, 2010 at 10:54 PM, Jamie Morken <jmorken(a)shaw.ca> wrote:
Hi all,
This is a preliminary list of what needs to be done to generate images dumps. If anyone
can help with #2 to provide the access log of image usage stats please send me an email!
1. run wikix to generate list of images for a given wiki ie. enwiki
2. sort the image list based on usage frequency from access log files
Hi,
It will be great to have these image dumps ! I wonder if a different
dump my be worth it for a different scenario:
* User only wants to get the photos for a small set of ids i.e. 1000 pages
What would be the proper way to get these photos without downloading
large dumps ?
a. Parse the actual html pages and get the actual image urls (plus
license info and then download the images) ?
b. Try to find the actual image urls using the commons wikitext
dump (and parse license info, ..) ?
Both approaches seem complicated so maybe a different dump would be helpful:
Page id --> List of [ Image id | real url | type (original |
dim_xy | thumb) | license ]
regards