commons.wikimedia.org allowing directory indexes and web robots

List overview All Threads
Download

newer

older

Re: [Wikitech-l]...

parserTests patches for known to...

Alexandre Dulaunoy

18 Jul 2009 18 Jul '09

3:50 p.m.

Hi All,

Commons.wikimedia.org is growing and provides a quite complete set of media files including a lot of interesting historical documents. Contributors are relying on the availability and persistence of commons.wikimedia.org but currently the full export is only available on download.wikimedia.org (ok not Today ;-).

I was wondering if it would be possible to allow web robots to access http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror the media files. As this is pure HTTP, the mirroring could benefit from the caching mechanisms of HTTP object (instead of having a large dump containing all the media files, that is more difficult to cache/update).

Maybe this could allow a more distributed backup approach to ensure the resilience of commons.wikimedia.org?

Thanks a lot for your work,

adulau

-- -- Alexandre Dulaunoy (adulau) -- http://www.foo.be/ -- http://www.foo.be/cgi-bin/wiki.pl/Diary -- "Knowledge can create problems, it is not through ignorance -- that we can solve them" Isaac Asimov

Attachments:

signature.asc (application/pgp-signature — 155 bytes)

Show replies by date

David Gerard

18 Jul 18 Jul

4:20 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

2009/7/18 Alexandre Dulaunoy a@foo.be:

...

I was wondering if it would be possible to allow web robots to access http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror the media files. As this is pure HTTP, the mirroring could benefit from the caching mechanisms of HTTP object (instead of having a large dump containing all the media files, that is more difficult to cache/update).

I see lots of files on upload.wikimedia.org on Google Image Search already. Is that actually forbidden by our robots.txt?

It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.

- d.

Alexandre Dulaunoy

4:29 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

On Sat, Jul 18, 2009 at 3:20 PM, David Gerarddgerard@gmail.com wrote:

...

2009/7/18 Alexandre Dulaunoy a@foo.be:

...
I was wondering if it would be possible to allow web robots to access http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror the media files. As this is pure HTTP, the mirroring could benefit from the caching mechanisms of HTTP object (instead of having a large dump containing all the media files, that is more difficult to cache/update).

I see lots of files on upload.wikimedia.org on Google Image Search already. Is that actually forbidden by our robots.txt?

It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.

But the current directory listing (upload dir) is disallowed, for example :

http://upload.wikimedia.org/wikipedia/commons/8/8c/

Of course, the bot will be able to get the media files by following the links from the other pages but this is not very handy/effective to make a exact mirror of just the current media files repository.

Would it possible to enable directory listing of http://upload.wikimedia.org/wikipedia/commons and the following subdirectories?

Thanks for the feedback,

Robert Rohde

4:34 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

On Sat, Jul 18, 2009 at 6:20 AM, David Gerarddgerard@gmail.com wrote:

...

2009/7/18 Alexandre Dulaunoy a@foo.be:

...
I was wondering if it would be possible to allow web robots to access http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror the media files. As this is pure HTTP, the mirroring could benefit from the caching mechanisms of HTTP object (instead of having a large dump containing all the media files, that is more difficult to cache/update).

I see lots of files on upload.wikimedia.org on Google Image Search already. Is that actually forbidden by our robots.txt?

It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.

Which is why my personal wiki is patched to translate the ".jpg" into "_jpg", etc. for all references to image description pages.

-Robert Rohde

David Gerard

4:55 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

2009/7/18 Robert Rohde rarohde@gmail.com:

...

On Sat, Jul 18, 2009 at 6:20 AM, David Gerarddgerard@gmail.com wrote:

...

...
It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.

...

Which is why my personal wiki is patched to translate the ".jpg" into "_jpg", etc. for all references to image description pages.

Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need for _jpg to be the default image page name and .jpg an alias for backward compatibility? That'd be really helpful in all sorts of ways - on pretty much any website *not* running MediaWiki, something ending ".jpg" is going to be the image, not a text page.

- d.

Dmitriy Sintsov

20 Jul 20 Jul

9:20 a.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

* David Gerard dgerard@gmail.com [Sat, 18 Jul 2009 14:55:28 +0100]:

...

2009/7/18 Robert Rohde rarohde@gmail.com:

...
On Sat, Jul 18, 2009 at 6:20 AM, David Gerarddgerard@gmail.com

wrote:

...
...
It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that,

so

...
...
it's up to them.

...
Which is why my personal wiki is patched to translate the ".jpg"

into

...

...
"_jpg", etc. for all references to image description pages.

Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need for _jpg to be the default image page name and .jpg an alias for backward compatibility? That'd be really helpful in all sorts of ways

on pretty much any website *not* running MediaWiki, something ending

".jpg" is going to be the image, not a text page.

I am not sure that the underscore is the most suitable character, because in MediaWiki it's interchangable with the space character. The type of the document should be determined by it's mime-type. If Google uses the web path "extension" (which is meaningless by the way, because that's a virtual path) instead of mime-type to determine whether the page should be indexed, that's amazing bug for Google. Dmitriy

David Gerard

12:08 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

2009/7/20 Dmitriy Sintsov questpc@rambler.ru:

...

David Gerard dgerard@gmail.com [Sat, 18 Jul 2009 14:55:28 +0100]:

...

...
Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need for _jpg to be the default image page name and .jpg an alias for backward compatibility? That'd be really helpful in all sorts of ways

on pretty much any website *not* running MediaWiki, something ending

".jpg" is going to be the image, not a text page.

...

I am not sure that the underscore is the most suitable character, because in MediaWiki it's interchangable with the space character.

Or whatever, as long as it isn't ending .jpg .

...

The type of the document should be determined by it's mime-type. If Google uses the web path "extension" (which is meaningless by the way, because that's a virtual path) instead of mime-type to determine whether the page should be indexed, that's amazing bug for Google.

Yes, it's an amazing bug for Google. It's also the way they do it.

- d.

Nikola Smolenski

12:45 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

Dmitriy Sintsov wrote:

...

because in MediaWiki it's interchangable with the space character. The type of the document should be determined by it's mime-type. If Google uses the web path "extension" (which is meaningless by the way, because that's a virtual path) instead of mime-type to determine whether the page should be indexed, that's amazing bug for Google.

It's a necessary evil however, because of a number of servers that serve incorrect mime types. IIRC, previously Google didn't index our images at all, but later added MediaWiki as an exception.

Aryeh Gregor

21 Jul 21 Jul

1:15 a.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

On Mon, Jul 20, 2009 at 6:20 AM, Dmitriy Sintsovquestpc@rambler.ru wrote:

...

I am not sure that the underscore is the most suitable character, because in MediaWiki it's interchangable with the space character. The type of the document should be determined by it's mime-type. If Google uses the web path "extension" (which is meaningless by the way, because that's a virtual path) instead of mime-type to determine whether the page should be indexed, that's amazing bug for Google.

Maybe they don't retrieve the page in the first place, because they don't want to waste bandwidth and processing time getting images. It would be rather a waste to send dozens or hundreds of HEAD requests on every Flickr page (or whatever) just to make sure that all those things ending in a suffix universally accepted to designate images really *are* images.

On Mon, Jul 20, 2009 at 9:45 AM, Nikola Smolenskismolensk@eunet.yu wrote:

...

It's a necessary evil however, because of a number of servers that serve incorrect mime types.

Well, that would make no difference if you actually downloaded the content, or the first handful of bytes. It's easy to *very* reliably distinguish binary image data from HTML if you get to look at the first several bytes of the file.

Anyway, I think the "right" way to do this would be to omit the suffix from the page name entirely, treating the format as an implementation detail. That way you could, for instance, upload an SVG over a PNG or a PNG over a JPEG, and have all users be automatically updated without manually changing the references. This does get a little confusing when you consider totally different types of media, though, like audio or video or PDF or whatnot. If NS_FILE (NS_IMAGE) weren't hardcoded in thirty million places both in code and templates, I might suggest different namespaces for different media types instead of one unified File: namespace, but that seems impractical at this point.

Robert Rohde

20 Jul 20 Jul

5:46 p.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

On Sat, Jul 18, 2009 at 6:55 AM, David Gerarddgerard@gmail.com wrote:

...

2009/7/18 Robert Rohde rarohde@gmail.com:

...
On Sat, Jul 18, 2009 at 6:20 AM, David Gerarddgerard@gmail.com wrote:

...
...
It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.

...
Which is why my personal wiki is patched to translate the ".jpg" into "_jpg", etc. for all references to image description pages.

Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need for _jpg to be the default image page name and .jpg an alias for backward compatibility? That'd be really helpful in all sorts of ways

on pretty much any website *not* running MediaWiki, something ending

".jpg" is going to be the image, not a text page.

Honestly, I'm not entirely sure how large a hack it would be. In my particular case, the hack I added was in the link generator very late in the process and fairly ugly. Really, one should probably be modifying Title.php to change the url form of image description page name, but there may be unexpected dependencies associated with doing that. Also, in my hack I used apache's mod_rewrite to get the right destination for incoming queries, but for a generally application this should also be handled by Title.php or something similar.

As long as one requires files have explicit type suffixes (e.g. ".jpg", ".svg", etc), one can use the allowed list to determine what file names to translate without generating conflicts. I believe all Wikimedia sites require such suffixes, but Mediawiki can be configured to remove that requirement which would need to be considered for a general application (i.e. what to do if the configuration allows separate files named "Foo_jpg" and "Foo.jpg")

I'd definitely like to see Mediawiki include a configuration option so that image description pages would handle the suffix differently, so maybe I'll think about it a bit more. This is one of a half dozen or so issues that I end up repatching on my local install every time I decide to upgrade.

-Robert Rohde

Mark Clements (HappyDog)

27 Jul 27 Jul

4:05 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

"Robert Rohde" rarohde@gmail.com wrote in message news:b4da1c6e0907200746v26a1b024naeb43c2228b80837@mail.gmail.com...

...

On Sat, Jul 18, 2009 at 6:55 AM, David Gerarddgerard@gmail.com wrote:

...
Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need for _jpg to be the default image page name and .jpg an alias for backward compatibility? That'd be really helpful in all sorts of ways

on pretty much any website *not* running MediaWiki, something ending

".jpg" is going to be the image, not a text page.

[SNIP]

...

As long as one requires files have explicit type suffixes (e.g. ".jpg", ".svg", etc), one can use the allowed list to determine what file names to translate without generating conflicts. I believe all Wikimedia sites require such suffixes, but Mediawiki can be configured to remove that requirement which would need to be considered for a general application (i.e. what to do if the configuration allows separate files named "Foo_jpg" and "Foo.jpg")

How about making the type a prefix? E.g. Image:jpg:Foo (or File:jpg:Foo). It would be a bit more work I suspect, but would retain the information that the extension gives as well as resolving the indexing problem which started this thread.

It would also be theoretically possible to use a one-to-many mapping here (so uploading Foo.jpg or Foo.jpeg result in the same File:jpg:Foo - though there might be an issue with naming conflicts here). Or go even more general, e.g. File:Image:Foo, File:Video:Foo, etc. All the name-conflict problems that would occur in any attempt to resolve the "changing image file format" problem would obviously apply here, but that might be better than dropping the type information given by an extension altogether.

- Mark Clements (HappyDog)

Brion Vibber

6:22 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On 7/27/09 6:05 AM, Mark Clements (HappyDog) wrote:

...

...
As long as one requires files have explicit type suffixes (e.g. ".jpg", ".svg", etc), one can use the allowed list to determine what file names to translate without generating conflicts. I believe all Wikimedia sites require such suffixes, but Mediawiki can be configured to remove that requirement which would need to be considered for a general application (i.e. what to do if the configuration allows separate files named "Foo_jpg" and "Foo.jpg")

How about making the type a prefix?

Really there's no reason to expose the file type at all at this level; it's an implementation detail which shouldn't be forced onto the on-wiki identifier for a media item.

-- brion

Mark Clements (HappyDog)

6:46 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

"Brion Vibber" brion@wikimedia.org wrote in message news:4A6DC653.9010003@wikimedia.org...

...

On 7/27/09 6:05 AM, Mark Clements (HappyDog) wrote:

...
...
As long as one requires files have explicit type suffixes (e.g. ".jpg", ".svg", etc), one can use the allowed list to determine what file names to translate without generating conflicts. I believe all Wikimedia sites require such suffixes, but Mediawiki can be configured to remove that requirement which would need to be considered for a general application (i.e. what to do if the configuration allows separate files named "Foo_jpg" and "Foo.jpg")

How about making the type a prefix?

Really there's no reason to expose the file type at all at this level; it's an implementation detail which shouldn't be forced onto the on-wiki identifier for a media item.

This suggestion was to solve the problem of serving html documents that appear to have a non-html file extension (e.g. page names which end .jpg). This would provide a one-to-one mapping that is more sensible (imho) than replacing the final period with another character (underscore was suggested).

There is a separate issue of whether this information should be removed altogether, which in theory is a good idea, but leads to a practical problem of naming conflicts which has not yet been addressed to my knowledge (e.g. when "File:Foo.jpg" and "File:Foo.gif" both exist). If that could be resolved then yes, the file's type information would not be required (either as a file extension, or elsewhere). In this case though, my second suggstion ("File:Video:Foo", "File:Image:Bar") might be useful, as we probably still want to know what type of file we are embedding, even if we don't need to know the exact file format.

In the absence of a solution to the second problem, and in light of the fact that solutions to the first issue are currently being considered, I think my original suggestion is quite relevant, and has the added bonus of still being useful if/when the second problem is solved.

- Mark Clements (HappyDog)

Aryeh Gregor

7:47 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On Mon, Jul 27, 2009 at 11:46 AM, Mark Clements (HappyDog)gmane@kennel17.co.uk wrote:

...

There is a separate issue of whether this information should be removed altogether, which in theory is a good idea, but leads to a practical problem of naming conflicts which has not yet been addressed to my knowledge (e.g. when "File:Foo.jpg" and "File:Foo.gif" both exist).

We'd have to keep the existing page names working anyway to avoid breaking everything, so we could just use the new convention for new uploads. Then old files could be moved to appropriate names manually over time, with conflicts resolved manually.

...

If that could be resolved then yes, the file's type information would not be required (either as a file extension, or elsewhere). In this case though, my second suggstion ("File:Video:Foo", "File:Image:Bar") might be useful, as we probably still want to know what type of file we are embedding, even if we don't need to know the exact file format.

Maybe, but there are potentially a lot of very specific formats. Like Djvu, PDF, document formats, spreadsheets, . . . It might be simplest to just drop the format info totally and assume it won't cause big problems if the format isn't obvious from the name.

Robert Rohde

8:03 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On Mon, Jul 27, 2009 at 9:47 AM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:

...

On Mon, Jul 27, 2009 at 11:46 AM, Mark Clements (HappyDog)gmane@kennel17.co.uk wrote:

...
There is a separate issue of whether this information should be removed altogether, which in theory is a good idea, but leads to a practical problem of naming conflicts which has not yet been addressed to my knowledge (e.g. when "File:Foo.jpg" and "File:Foo.gif" both exist).

We'd have to keep the existing page names working anyway to avoid breaking everything, so we could just use the new convention for new uploads. Then old files could be moved to appropriate names manually over time, with conflicts resolved manually.

<snip>

Forgive me, but that seems like you'd be asking the community to do a huge amount of work (moving images and updating [[File:]] calls) in order to address a problem that could be solved on purely technical grounds.

At least, that is, if we agree that the problem is principally having "misleading" file extensions in urls for HTML content. http://en.wikipedia.org/wiki/File:Foo.jpg could be translated into any number of things through a completely unambiguous one-to-one mapping that would remove or mask the ".jpg" extension. That is something I would like to see and encourage.

However, if the "solution" is to manually rename everything to extension-less structure then I would be opposed to that. It is more trouble than it is worth, and does little to benefit the existing wikis owned by Wikimedia or those controlled by third parties. Personally, I think it is actually a good thing that files have file-like nomenclature in general. It seems less confusing for uploaders that way. I'd prefer the current nomenclature be preserved but some addition system of naming, minus the confusing extensions, be placed on top as the default.

-Robert Rohde

Aryeh Gregor

8:09 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On Mon, Jul 27, 2009 at 1:03 PM, Robert Rohderarohde@gmail.com wrote:

...

Forgive me, but that seems like you'd be asking the community to do a huge amount of work (moving images and updating [[File:]] calls) in order to address a problem that could be solved on purely technical grounds.

Well, we could automatically move everything to the new names and leave redirects, and only leave conflicts to be manually resolved.

...

At least, that is, if we agree that the problem is principally having "misleading" file extensions in urls for HTML content.

I don't think that's the only problem we should be solving here. We should also allow an image in one format to be replaced by an image in another format without changing the name. That requires getting rid of the extensions entirely. (Allowing an image to be replaced by a video or such, however, wouldn't make much sense.)

Robert Rohde

8:39 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On Mon, Jul 27, 2009 at 10:09 AM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:

...

On Mon, Jul 27, 2009 at 1:03 PM, Robert Rohderarohde@gmail.com wrote:

...
Forgive me, but that seems like you'd be asking the community to do a huge amount of work (moving images and updating [[File:]] calls) in order to address a problem that could be solved on purely technical grounds.

Well, we could automatically move everything to the new names and leave redirects, and only leave conflicts to be manually resolved.

Last I checked image moves weren't actually working and I thought image redirects were disabled as well, though I could be mistaken. Those are technical issues that it would be good to solve for their own reasons though.

However, if redirects work in the traditional way, then it wouldn't solve my problem. Namely File:Foo.jpg might draw it's content from File:Foo, but it still lives at a url for File:Foo.jpg. In order to avoid the extensions in urls you need to change where the links actually go, which at the present time requires changing each actual call.

Beyond that, it strikes me that it would be very hard to do the kind of automatic resolution you have in mind without breaking things. You can arguably do it on a single wiki, but with Commons in the mix it gets considerably harder. If Commons has Foo.jpg and Enwiki has Foo.gif, then who gets to live at File:Foo? Either you have to check for conflicts across all wikis or you are likely to end up with at least some wikis with unexpected links.

...

...
At least, that is, if we agree that the problem is principally having "misleading" file extensions in urls for HTML content.

I don't think that's the only problem we should be solving here. We should also allow an image in one format to be replaced by an image in another format without changing the name. That requires getting rid of the extensions entirely. (Allowing an image to be replaced by a video or such, however, wouldn't make much sense.)

...

From my point of view that's a much less annoying bug than the link

formatting one. Not to mention that there are cases when it is beneficial to explicitly provide different file formats for the same material (for example if an SVG renders poorly on the WMF system).

They aren't antagonistic proposals though. One could make changes that allow extension agnostic file names, e.g. File:Foo, while also coming up with an automatic way to hide file extensions on existing works regardless of whether they are moved/redirected. Any reason not to allow both? As mentioned earlier in the thread, I've been patching my own wikis to mask extensions for years.

-Robert Rohde

Brion Vibber

28 Jul 28 Jul

12:16 a.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On 7/27/09 10:39 AM, Robert Rohde wrote:

...

On Mon, Jul 27, 2009 at 10:09 AM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:

...
On Mon, Jul 27, 2009 at 1:03 PM, Robert Rohderarohde@gmail.com wrote:

...
Forgive me, but that seems like you'd be asking the community to do a huge amount of work (moving images and updating [[File:]] calls) in order to address a problem that could be solved on purely technical grounds.

Well, we could automatically move everything to the new names and leave redirects, and only leave conflicts to be manually resolved.

Last I checked image moves weren't actually working and I thought image redirects were disabled as well, though I could be mistaken. Those are technical issues that it would be good to solve for their own reasons though.

Image redirects are quite active. Renames were re-disabled due to breakage with images which had missing past versions (eg, a lot in production) -- which I think has been fixed to handle this case cleanly.

Anyway, don't consider that an impediment.

...

However, if redirects work in the traditional way, then it wouldn't solve my problem. Namely File:Foo.jpg might draw it's content from File:Foo, but it still lives at a url for File:Foo.jpg. In order to avoid the extensions in urls you need to change where the links actually go, which at the present time requires changing each actual call.

You wouldn't care if anybody indexed File:Foo.jpg, since the content would be indexed at File:Foo.

...

Beyond that, it strikes me that it would be very hard to do the kind of automatic resolution you have in mind without breaking things. You can arguably do it on a single wiki, but with Commons in the mix it gets considerably harder. If Commons has Foo.jpg and Enwiki has Foo.gif, then who gets to live at File:Foo? Either you have to check for conflicts across all wikis or you are likely to end up with at least some wikis with unexpected links.

This is hardly an insurmountable problem; automated renames can easily detect the existence of such conflicts and either leave them for eventual manual attention or give them disambiguating suffixes.

...

They aren't antagonistic proposals though. One could make changes that allow extension agnostic file names, e.g. File:Foo, while also coming up with an automatic way to hide file extensions on existing works regardless of whether they are moved/redirected. Any reason not to allow both?

There's no particular reason to do the latter when its results are equivalent to the former.

-- brion

Aryeh Gregor

1:38 a.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On Mon, Jul 27, 2009 at 1:39 PM, Robert Rohderarohde@gmail.com wrote:

...

Last I checked image moves weren't actually working and I thought image redirects were disabled as well, though I could be mistaken. Those are technical issues that it would be good to solve for their own reasons though.

Well, I was actually thinking that in this case we could do a proper 301 if you try directly visiting the page, and actually change all generated links. Since upload of new files under names with extensions would be forbidden, the redirect would be immutable and there would be no need to support the redirect notice.

...

Beyond that, it strikes me that it would be very hard to do the kind of automatic resolution you have in mind without breaking things. You can arguably do it on a single wiki, but with Commons in the mix it gets considerably harder. If Commons has Foo.jpg and Enwiki has Foo.gif, then who gets to live at File:Foo? Either you have to check for conflicts across all wikis or you are likely to end up with at least some wikis with unexpected links.

We'd have to check for conflicts across wikis, sure.

...

From my point of view that's a much less annoying bug than the link formatting one.

My opinion is the opposite. The issue with indexing isn't a bug on our side at all, it's a deficiency with how Google indexes pages. If Google doesn't want to needlessly retrieve zillions of images and needs a hint that we're linking to an HTML page, then the correct fix on our side would be to do

and then find some Googlers to poke with pointy sticks if they don't respect the type="" attribute. We could do that immediately, in fact. I'm sure they'd be happy to remove their special-case code. (I really wish they'd talk to us about things like this instead of trying to hack around our less-than-ideal behavior . . .)

On the other hand, having the file format be part of the page name is a pain in the neck.

...

Not to mention that there are cases when it is beneficial to explicitly provide different file formats for the same material (for example if an SVG renders poorly on the WMF system).

Then they could just be at different names, so nothing's lost. On the other hand it's very common for people to upload things that should be PNG as JPEG, or things that should be SVG as PNG/JPEG, and currently we have to rename. Plus we can currently have Foo.jpg and Foo.jpeg and Foo.JPG and Foo.JPEG and Foo.png and Foo.PNG and Foo.svg and Foo.SVG, or whatever, which is unreasonable.

Brion Vibber

27 Jul 27 Jul

9:12 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On 7/27/09 10:03 AM, Robert Rohde wrote:

...

Forgive me, but that seems like you'd be asking the community to do a huge amount of work (moving images and updating [[File:]] calls) in order to address a problem that could be solved on purely technical grounds.

There's no technical need to change anything; ***Google already knows what our site looks like and indexes our image pages just fine since years ago***.

We're just talking about what would look nicer going forward, which would be to do things more sanely and not spam a file extension onto the on-wiki page name when it's really not necessary.

-- brion

Robert Rohde

9:37 p.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On Mon, Jul 27, 2009 at 11:12 AM, Brion Vibberbrion@wikimedia.org wrote:

...

On 7/27/09 10:03 AM, Robert Rohde wrote:

...
Forgive me, but that seems like you'd be asking the community to do a huge amount of work (moving images and updating [[File:]] calls) in order to address a problem that could be solved on purely technical grounds.

There's no technical need to change anything; ***Google already knows what our site looks like and indexes our image pages just fine since years ago***.

Google indexes the WMF this way. ***They do not index third party Mediawiki sites this way.***

Compare:

http://www.google.com/search?q=file:*.jpg+site:wikimedia.org

Which shows no end of *.jpg Image description pages on Commons to

http://www.google.com/search?q=file:*.jpg+site:mediawiki.org http://www.google.com/search?q=file:*.jpg+site:memory-alpha.org http://www.google.com/search?q=file:*.jpg+site:stargate.wikia.com

Which show no *.jpg image description pages on mediawiki.org, memory-alpha.org, or stargate.wikia.org

So congratulations Google treats the WMF special, but the rest of the Mediawiki user base, myself included, still have a problem that we would like to see solved.

-Robert Rohde

Brion Vibber

28 Jul 28 Jul

12:11 a.m.

New subject: commons.wikimedia.org allowing directory indexesand web robots

On 7/27/09 11:37 AM, Robert Rohde wrote:

...

Google indexes the WMF this way. ***They do not index third party Mediawiki sites this way.***

[snip]

...

So congratulations Google treats the WMF special, but the rest of the Mediawiki user base, myself included, still have a problem that we would like to see solved.

Feel free to rename your files as you like once sane naming is supported natively. :)

-- brion

Robert Rohde

22 Jul 22 Jul

10:44 a.m.

New subject: commons.wikimedia.org allowing directory indexes and web robots

On Sat, Jul 18, 2009 at 6:55 AM, David Gerarddgerard@gmail.com wrote:

...

2009/7/18 Robert Rohde rarohde@gmail.com:

...
On Sat, Jul 18, 2009 at 6:20 AM, David Gerarddgerard@gmail.com wrote:

...
...
It'd actually be better if Google properly indexed text pages whose name ends in .jpg or whatever ... but they're aware we'd like that, so it's up to them.

...
Which is why my personal wiki is patched to translate the ".jpg" into "_jpg", etc. for all references to image description pages.

Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need for _jpg to be the default image page name and .jpg an alias for backward compatibility? That'd be really helpful in all sorts of ways

on pretty much any website *not* running MediaWiki, something ending

".jpg" is going to be the image, not a text page.

I've created bug:19874 for this enhancement request. As it is of personal utility to me, I may also work on writing a patch, though probably not in the near term.

-Robert Rohde

5624

Age (days ago)

5633

Last active (days ago)

wikitech-l@lists.wikimedia.org

22 comments

8 participants

tags (0)

participants (8)

Alexandre Dulaunoy
Aryeh Gregor
Brion Vibber
David Gerard
Dmitriy Sintsov
Mark Clements (HappyDog)
Nikola Smolenski
Robert Rohde