list of things to do for image dumps

List overview All Threads
Download

newer

older

cite/ref error

Request for test wiki admin access

Jamie Morken

10 Sep 2010 10 Sep '10

4:54 a.m.

Hi all,

This is a preliminary list of what needs to be done to generate images dumps. If anyone can help with #2 to provide the access log of image usage stats please send me an email!

1. run wikix to generate list of images for a given wiki ie. enwiki

2. sort the image list based on usage frequency from access log files

3. download images over HTTP

4. scale images down to desired thumbnail size

5. tar image collection into multiple tar files (first tar file has the most commonly used images)

6. post to Wikimedia image dump or bittorrent

cheers, Jamie

Show replies by date

Lars Aronsson

10 Sep 10 Sep

6:03 a.m.

On 09/09/2010 10:54 PM, Jamie Morken wrote:

...

Hi all,

If anyone can help with #2 to provide the access log of image usage stats please send me an email! 2. sort the image list based on usage frequency from access log files

The raw data is one file per hour, containing a list of page names and visit counts. From just one such file, you get statistics on what's the most visited pages during that particular hour. By combining more files, you can get statistics for a whole day, a week, a month, a year, all Mondays, all 7am hours around the year, the 3rd Sunday after Easter, or whatever. The combinations are almost endless.

How do we boil this down to a few datasets that are most useful? Is that the total visit count per month? Or what?

Are these visitor stats already in a database on the toolserver? If so, how are they organized?

I wrote some documentation on the access log format here, http://www.archive.org/details/wikipedia_visitor_stats_200712

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

emijrp

11 Sep 11 Sep

1:45 a.m.

Hi Lars, are you going to upload more logs to Internet Archive? Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place.

2010/9/10 Lars Aronsson lars@aronsson.se

...

On 09/09/2010 10:54 PM, Jamie Morken wrote:

...
Hi all,

If anyone can help with #2 to provide the access log of image usage stats

please send me an email!

...

sort the image list based on usage frequency from access log files

The raw data is one file per hour, containing a list of page names and visit counts. From just one such file, you get statistics on what's the most visited pages during that particular hour. By combining more files, you can get statistics for a whole day, a week, a month, a year, all Mondays, all 7am hours around the year, the 3rd Sunday after Easter, or whatever. The combinations are almost endless.

How do we boil this down to a few datasets that are most useful? Is that the total visit count per month? Or what?

Are these visitor stats already in a database on the toolserver? If so, how are they organized?

I wrote some documentation on the access log format here, http://www.archive.org/details/wikipedia_visitor_stats_200712

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Lars Aronsson

17 Sep 17 Sep

10:33 p.m.

On September 10, emijrp wrote:

...

Hi Lars, are you going to upload more logs to Internet Archive?

No, I can't. I have not downloaded more recent logs. I only uploaded what was on my disk, because I needed to free some space.

...

Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place.

"Must"? Says who? That sounds like a naive opinion. If you have an interest, you can do the job. Otherwise they will get lost. In the future, maybe this should be a task for the paid staff, but so far it has not been.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

emijrp

19 Sep 19 Sep

5:35 a.m.

Thanks! : )

2010/9/17 Lars Aronsson lars@aronsson.se

...

On September 10, emijrp wrote:

...
Hi Lars, are you going to upload more logs to Internet Archive?

No, I can't. I have not downloaded more recent logs. I only uploaded what was on my disk, because I needed to free some space.

...
Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place.

"Must"? Says who? That sounds like a naive opinion. If you have an interest, you can do the job. Otherwise they will get lost. In the future, maybe this should be a task for the paid staff, but so far it has not been.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jose

10 Sep 10 Sep

8:02 p.m.

On Thu, Sep 9, 2010 at 10:54 PM, Jamie Morken jmorken@shaw.ca wrote:

...

Hi all,

This is a preliminary list of what needs to be done to generate images dumps. If anyone can help with #2 to provide the access log of image usage stats please send me an email!

run wikix to generate list of images for a given wiki ie. enwiki

sort the image list based on usage frequency from access log files

Hi,

It will be great to have these image dumps ! I wonder if a different dump my be worth it for a different scenario:

* User only wants to get the photos for a small set of ids i.e. 1000 pages

What would be the proper way to get these photos without downloading large dumps ?

a. Parse the actual html pages and get the actual image urls (plus license info and then download the images) ?

b. Try to find the actual image urls using the commons wikitext dump (and parse license info, ..) ?

Both approaches seem complicated so maybe a different dump would be helpful:

regards

Roan Kattouw

8:44 p.m.

2010/9/10 Jose jmalv04@gmail.com:

...

Both approaches seem complicated so maybe a different dump would be helpful:

Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]

http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&i...

Returns image URL, width, height and thumbnail URL for a 200px thumbnail.

Roan Kattouw (Catrope)

Jose

9:09 p.m.

On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.kattouw@gmail.com wrote:

...

...
Both approaches seem complicated so maybe a different dump would be helpful:

Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]

http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&i...

Returns image URL, width, height and thumbnail URL for a 200px thumbnail.

Thanks, this may be useful. So let's say I want to get all images for the Ant page, the steps will be:

1. Parse the Ant page wikitext and get all Image: links

2. For every image link get it's commons page id (Can I issue the above query using the title ids instead on number ids ? . If not, then use the commons repository to map image title to number id)

3. Issue a query like the one you detail above (but the results don't show license info !).

Still, I think having a small dump with metadata is better than sending a lot of api queries

thanks

Bryan Tong Minh

9:23 p.m.

On Fri, Sep 10, 2010 at 3:09 PM, Jose jmalv04@gmail.com wrote:

...

On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.kattouw@gmail.com wrote:

...
...
Both approaches seem complicated so maybe a different dump would be helpful:

Page id --> List of [ Image id | real url | type (original | dim_xy | thumb) | license ]

http://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&i...

Returns image URL, width, height and thumbnail URL for a 200px thumbnail.

Thanks, this may be useful. So let's say I want to get all images for the Ant page, the steps will be:

Just use prop=images as a generator on en.wikipedia.org. This will yield the thumb urls as well as the urls of the commons pages, which can then be fetched separately.

Bryan

Roan Kattouw

11:55 p.m.

2010/9/10 Bryan Tong Minh bryan.tongminh@gmail.com:

...

Just use prop=images as a generator on en.wikipedia.org. This will yield the thumb urls as well as the urls of the commons pages, which can then be fetched separately.

Concrete example:

http://en.wikipedia.org/w/api.php?action=query&generator=images&giml...

Licensing info is not available through the API because it's just some text or template on the image description page; it has no meaning to the MediaWiki software.

Roan Kattouw (Catrope)

5185

Age (days ago)

5194

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Bryan Tong Minh
emijrp
Jamie Morken
Jose
Lars Aronsson
Roan Kattouw