Re: [Multimedia] [Commons-l] Hashing Wikimedia Commons

List overview All Threads
Download

newer

older

Odd requestlog entry

Re: [Multimedia] [Commons-l]...

Jean-Frédéric

4 Sep 2014 4 Sep '14

2:49 p.m.

...

...
The first three we can get from pretty much either API, or extract directly from

...
a dump file. The latter is eluding us though, for two reasons. One is

that a

...
file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/

directory -

...
but where this /b/ba/ comes from (a hash?) is unclear to us now, and

it's not

...
something we find in the dumps - though we can get it from one of the

APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical way would be to use

https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machin...

Which generates a redirect to

https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2...

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/" and the desired size in the correct location (maybe Special:Redirect can do that for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machi...

If I am not mistaken you can use thumb.php to get the needed thumb? https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100

(That’s what I used in my CommonsDownloader [1])

[1] < https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownloade...

...

Hope that helps,

-- Jean-Frédéric

Attachments:

attachment.htm (text/html — 2.6 KB)

Show replies by date

Daniel Schwen

4 Sep 4 Sep

4:04 p.m.

New subject: [Commons-l] Hashing Wikimedia Commons

I was told thumb.php is evil (for lack of caching). I'm using special:redirect with the width=640 parameter. Daniel On Sep 4, 2014 5:49 AM, "Jean-Frédéric" jeanfrederic.wiki@gmail.com wrote:

...

...
The first three we can get from pretty much either API, or extract

...
directly from

...
a dump file. The latter is eluding us though, for two reasons. One is

that a

...
file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/

directory -

...
but where this /b/ba/ comes from (a hash?) is unclear to us now, and

it's not

...
something we find in the dumps - though we can get it from one of the

APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical way would be to use

https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machin...

Which generates a redirect to

https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2...

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/" and the desired size in the correct location (maybe Special:Redirect can do that for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machi...

If I am not mistaken you can use thumb.php to get the needed thumb? https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100

(That’s what I used in my CommonsDownloader [1])

[1] < https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownloade...

...
Hope that helps,

Jean-Frédéric

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Derk-Jan Hartman

10:44 p.m.

New subject: [Commons-l] Hashing Wikimedia Commons

On 4 sep. 2014, at 15:04, Daniel Schwen daniel@schwen.de wrote:

...

I was told thumb.php is evil (for lack of caching). I'm using special:redirect with the width=640 parameter. Daniel

Correct, better not rely on thumb.php, the servers will just generate the thumb if it is not yet present on the canonical address yet, that Special:Redirect can point you at.

Also, almost all this info can be retrieved in one go from the api.php of course:

http://commons.wikimedia.org/w/api.php?action=query&titles=File:30C3_Com...

Lists almost all the info of the latest revision of the file.

...

On Sep 4, 2014 5:49 AM, "Jean-Frédéric" jeanfrederic.wiki@gmail.com wrote:

...
The first three we can get from pretty much either API, or extract directly from a dump file. The latter is eluding us though, for two reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not something we find in the dumps - though we can get it from one of the APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical way would be to use https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machin...

Which generates a redirect to https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2...

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/" and the desired size in the correct location (maybe Special:Redirect can do that for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machi...

If I am not mistaken you can use thumb.php to get the needed thumb? https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100

(That’s what I used in my CommonsDownloader [1])

[1] https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownloader/thumbnaildownload.py

Hope that helps,

Jean-Frédéric

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Derk-Jan Hartman

10:47 p.m.

New subject: [Commons-l] Hashing Wikimedia Commons

Correct, better not rely on thumb.php, the servers will just generate the thumb if it is not yet present on the canonical address yet, that Special:Redirect can point you at.

Also, almost all this info can be retrieved in one go from the api.php of course:

http://commons.wikimedia.org/w/api.php?action=query&titles=File:30C3_Com...

Lists almost all the info of the latest revision of the file.

On Thu, Sep 4, 2014 at 3:04 PM, Daniel Schwen daniel@schwen.de wrote:

...

I was told thumb.php is evil (for lack of caching). I'm using special:redirect with the width=640 parameter. Daniel

On Sep 4, 2014 5:49 AM, "Jean-Frédéric" jeanfrederic.wiki@gmail.com wrote:

...
...
...
...
The first three we can get from pretty much either API, or extract directly from a dump file. The latter is eluding us though, for two reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not something we find in the dumps - though we can get it from one of the APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical way would be to use

https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machin...

Which generates a redirect to

https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2...

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/" and the desired size in the correct location (maybe Special:Redirect can do that for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machi...

If I am not mistaken you can use thumb.php to get the needed thumb? https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100

(That’s what I used in my CommonsDownloader [1])

[1] https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownloader/thumbnaildownload.py

Hope that helps,

Jean-Frédéric

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Jonas Öberg

5 Sep 5 Sep

11:21 a.m.

New subject: [Commons-l] Hashing Wikimedia Commons

Thanks to everyone who took time to contribute here!

Let me try to sum up, from my understanding. For metadata information about an image, using the imageinfo/extmetadata API is sensible for the moment. We're aware and followed the talks on the structured data project during Wikimania, and we're quite keen to see the results of that when and if it starts being useful.

For thumbnails, there's no way to know if a thumbnail size has already been rendered or not, but given that the MediaViewer has a default list of widths that correspond to popular screen size resolutions[1], it's a fair bet that for instance 640x and 800x would work, except for situations when the image file is smaller than the requested thumbnail size.

It's possible to use Special:Redirect or thumb.php to get the thumbnail/URL, but both are actually PHP scripts that need running. So while perhaps not ideal, it seems to make the most sense here to generate the thumbnail URLs ourselves and hit the web server directly.

Sincerely, Jonas

[1] https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FMultimediaViewer/e7e...

On 4 September 2014 21:47, Derk-Jan Hartman hartman.wiki@gmail.com wrote:

...

Correct, better not rely on thumb.php, the servers will just generate the thumb if it is not yet present on the canonical address yet, that Special:Redirect can point you at.

Also, almost all this info can be retrieved in one go from the api.php of course:

http://commons.wikimedia.org/w/api.php?action=query&titles=File:30C3_Com...

Lists almost all the info of the latest revision of the file.

DJ

On Thu, Sep 4, 2014 at 3:04 PM, Daniel Schwen daniel@schwen.de wrote:

...
I was told thumb.php is evil (for lack of caching). I'm using special:redirect with the width=640 parameter. Daniel

On Sep 4, 2014 5:49 AM, "Jean-Frédéric" jeanfrederic.wiki@gmail.com wrote:

...
...
...
...
The first three we can get from pretty much either API, or extract directly from a dump file. The latter is eluding us though, for two reasons. One is that a file, like 30C3_Commons_Machinery_2.jpg, is actually in the /b/ba/ directory - but where this /b/ba/ comes from (a hash?) is unclear to us now, and it's not something we find in the dumps - though we can get it from one of the APIs.

Yes, /b/ba ist based on the first two digits of the MD5 hash of the title:

md5( "30C3_Commons_Machinery_2.jpg" ) -> ba253c78d894a80788940a3ca765debb

But this is "arcane knowledge" which nobody should really rely on. The canonical way would be to use

https://commons.wikimedia.org/wiki/Special:Redirect/file/30C3_Commons_Machin...

Which generates a redirect to

https://upload.wikimedia.org/wikipedia/commons/b/ba/30C3_Commons_Machinery_2...

To get a thumbnail, you can directly manipulate that URL, by inserting "thumb/" and the desired size in the correct location (maybe Special:Redirect can do that for you, but I do not know how):

https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/30C3_Commons_Machi...

If I am not mistaken you can use thumb.php to get the needed thumb? https://commons.wikimedia.org/w/thumb.php?f=Example.jpg&width=100

(That’s what I used in my CommonsDownloader [1])

[1] https://github.com/Commonists/CommonsDownloader/blob/master/commonsdownloader/thumbnaildownload.py

Hope that helps,

Jean-Frédéric

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia

Gergo Tisza

2:43 p.m.

New subject: [Commons-l] Hashing Wikimedia Commons

On Fri, Sep 5, 2014 at 10:21 AM, Jonas Öberg jonas@commonsmachinery.se wrote:

...

It's possible to use Special:Redirect or thumb.php to get the thumbnail/URL, but both are actually PHP scripts that need running. So while perhaps not ideal, it seems to make the most sense here to generate the thumbnail URLs ourselves and hit the web server directly.

That can work if you don't mind getting errors in some % of cases where the file format would require a more complex URL scheme. Otherwise, you have three options:

- just use Special:Redirect. Depending on your request frequency, it might be fine. We can ask ops what speed limit would be reasonable; for bots using the API, the general recommendation is 12 requests per minute. - scrape file description pages. The HTML page is cached in varnish and it has links to various standard image sizes, so you won't hit PHP this way; of course, HTML scraping is not the most reliable way of retrieving data. - use the API in batches. You can retrieve the information (including thumbnail URL) for 500 files in a single request (5000 if you get a bot flag):

https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&titles...

IMO the last option is the cleanest one.

Jonas Öberg

2:54 p.m.

New subject: [Commons-l] Hashing Wikimedia Commons

Hi Gergo,

...

That can work if you don't mind getting errors in some % of cases where the file format would require a more complex URL scheme.

I forgot an important aspect which is that at this point, we're only concerned about JPG and PNG formats, which I suppose should be fairly uncomplex.

...

use the API in batches. You can retrieve the information (including thumbnail URL) for 500 files in a single request (5000 if you get a bot flag):

That's a neat idea - I didn't know that the API took multiple file names in one query. If we could do 500 files per request, 10-12 per minute, that's a more than adequate - but it feels that this is something that we should be talking to ops about to validate?

Sincerely, Jonas

Gergo Tisza

4:25 p.m.

New subject: [Commons-l] Hashing Wikimedia Commons

On Fri, Sep 5, 2014 at 1:54 PM, Jonas Öberg jonas@commonsmachinery.se wrote:

...

That's a neat idea - I didn't know that the API took multiple file names in one query. If we could do 500 files per request, 10-12 per minute, that's a more than adequate - but it feels that this is something that we should be talking to ops about to validate?

12 per minute is the default setting for the standard bot framework https://www.mediawiki.org/wiki/Manual:Pywikibot, there are lots of bots doing processing with that speed and the max allowed item limit. I don't think you need to ask anyone before doing that. If you want to be extra nice, you can use the maxlag parameter https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.

3774

Age (days ago)

3775

Last active (days ago)

multimedia@lists.wikimedia.org

7 comments

6 participants

tags (0)

participants (6)

Daniel Schwen
Derk-Jan Hartman
Derk-Jan Hartman
Gergo Tisza
Jean-Frédéric
Jonas Öberg