Re: Upload URL/filesystem restructuring

List overview All Threads
Download

newer

older

Re: NTL

About the cancelling

Nick Jenkins

24 Oct 2005 24 Oct '05

midnight

Brion Vibber wrote:

...

One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg

and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg

Alternative but very similar idea would be to embed the revision number in the URL, instead of the upload timestamp:

Example original: http://upload.wikimedia.org/wikipedia/en/P/1/Puppy.jpg

Example revised: http://upload.wikimedia.org/wikipedia/en/P/2/Puppy.jpg

Then internally there needs to be some translation/lookup table from image name --> current revision number, as opposed to a lookup table for image name --> upload date. (Integers are smaller than dates, so small memory saving perhaps).

Possibility of a very very very small bandwidth saving from slightly shorter URLs.

Maybe also it helps if two people upload Puppy.jpg at the exact same second (not sure what happens in a date timestamp system that's only accurate to the second when this happens, but in a revision number system one is always going to first, even by a few microseconds).

Lastly, it's easy for a human with the URL to see what revisions come before/after by incrementing/decrementing the digit in the URL, whereas the date and time of the upload of a previous revision cannot be predicted just from the image name.

All other benefits as per timestamp system, I think.

All the best, Nick.

Show replies by date

Brion Vibber

24 Oct 24 Oct

6:57 a.m.

New subject: Upload URL/filesystem restructuring

Nick Jenkins wrote:

...

Brion Vibber wrote:

...
One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg

and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg

Alternative but very similar idea would be to embed the revision number in the URL, instead of the upload timestamp:

Example original: http://upload.wikimedia.org/wikipedia/en/P/1/Puppy.jpg

Example revised: http://upload.wikimedia.org/wikipedia/en/P/2/Puppy.jpg

We had a lively discussion on in #wikimedia-tech on this subject; as well as the revision ID numbers another possibility discussed was using a content hash.

A content hash has the additional advantage that duplicate file versions only need to be stored once; for instance currently when reverting a file it makes a new copy of the file on the filesystem, which wastes space. (However you then need to be careful about deleting.)

So you might have something like: http://upload.wikimedia.org/584/590/5845907fdfc6eb1125129c4ce0da0704c496a7e4...

Obviously a disadvantage is that the filenames are ugly. One might tack a 'pretty' but ignored filename on the end, using rewrites or whatever tool to drop it on the backend:

http://upload.wikimedia.org/584/590/5845907fdfc6eb1125129c4ce0da0704c496a7e4...

This does though complicate the server configuration; I think a goal should be making it very easy to set up a file mirror that we can actually send requests to. Arbitrary filename additions may also have security implications for broken browsers like Internet Explorer which like to interpet filetype information out of the "extension" on the URL.

...

Lastly, it's easy for a human with the URL to see what revisions come before/after by incrementing/decrementing the digit in the URL, whereas the date and time of the upload of a previous revision cannot be predicted just from the image name.

That might be kind of neat, but requires maintaining a consistent revision sequence _within_ each image. If using revision numbers, it's easier to work with the global row id numbers as the database can guarantee their uniqueness.

-- brion vibber (brion @ pobox.com)

Tels

5:02 p.m.

New subject: Upload URL/filesystem restructuring

-----BEGIN PGP SIGNED MESSAGE-----

Moin,

On Monday 24 October 2005 08:57, Brion Vibber wrote:

...

Nick Jenkins wrote:

...
Brion Vibber wrote:

...
One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg

and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg

Alternative but very similar idea would be to embed the revision number in the URL, instead of the upload timestamp:

Example original: http://upload.wikimedia.org/wikipedia/en/P/1/Puppy.jpg

Example revised: http://upload.wikimedia.org/wikipedia/en/P/2/Puppy.jpg

We had a lively discussion on in #wikimedia-tech on this subject; as well as the revision ID numbers another possibility discussed was using a content hash.

A content hash has the additional advantage that duplicate file versions only need to be stored once; for instance currently when reverting a file it makes a new copy of the file on the filesystem, which wastes space. (However you then need to be careful about deleting.)

So you might have something like: http://upload.wikimedia.org/584/590/5845907fdfc6eb1125129c4ce0da0704c49 6a7e4.jpg

Obviously a disadvantage is that the filenames are ugly. One might tack a 'pretty' but ignored filename on the end, using rewrites or whatever tool to drop it on the backend:

http://upload.wikimedia.org/584/590/5845907fdfc6eb1125129c4ce0da0704c49 6a7e4/Puppy.jpg

Which is still very human-unfriendly. I couldn't remember this URl even if my life depended on it!

I rather like

http://upload.wikimedia.org/wikipedia/en/P/2/Puppy.jpg

although I am not sure why the "/P/" needs to be visible to the user (it is deterministic, after all), and it would be handy to have a "latest" revision URL. Which could be just:

http://upload.wikimedia.org/wikipedia/en/Puppy.jpg

and the software behind the back figures out what the exact latest revision is and under what /CapitcalLetter/ directory it falls. These are things a human user shouldn't need to do or know about.

(Yes, I know, it is technically difficult. But I'd rather have you spent some time figuring it out and implementing it, than every wikipedia user to remember these little technicalities :)

If the plan is to hide all that, well, please forget my 0.02€.

Best wishes,

Tels

- -- Signed on Mon Oct 24 18:58:02 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.

"Any sufficiently advanced technology is indistinguishable from a rigged demo." -- Andy Finkel, computer guy

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iQEVAwUBQ10TkHcLPEOTuEwVAQHz3Qf+OW9LuCOP+uT6WMy0hB/sNYRjMSGzWTsL YZm7UzRj6d1Ee8FCOYrK0D0KVG+I/5BWIKmgvsfm3JEOb20zZvo8rJVnfIIzEtcL Dej+GCJ9xMEkh1k3yb9Oaraq/eAHe7Jt73NEa2xzXeMfBPpQoKYEUcveRP5TDLCd nV8ca0i/Sizjrm0LDt6nsRBHsAB3viH3ZN4E5K0QuULUSZrTCyb1KsJBIWW/jxWh Tm+Bju64thxkykvOb6WulG54hueeO/OFRMzILkIkCaykY7h5b585pg53hyNXUnyK xFttujCS9cNIjqYyhuoWykZnx1w8hgIpwhGyxvK12vvH68jijrlYAQ== =glTW -----END PGP SIGNATURE-----

Ivan Krstic

5:21 p.m.

New subject: Upload URL/filesystem restructuring

Tels wrote:

...

although I am not sure why the "/P/" needs to be visible to the user (it is deterministic, after all), and it would be handy to have a "latest" revision URL. Which could be just: http://upload.wikimedia.org/wikipedia/en/Puppy.jpg and the software behind the back figures out what the exact latest revision is and under what /CapitcalLetter/ directory it falls.

Please read Brion's original message again.

...

These are things a human user shouldn't need to do or know about.

Serving static content usually amounts to just pushing bits, and that's all it should be. If you want it to be more than that, you need to call into a language that will supply an additional layer of logic before you can start pushing bits, which is usually enormously slower than being able to say "here's a link to a static file". You can serve the latter with all sorts of optimizations -- one example being in-kernel httpd, or one of the very fast userland ones.

-- Ivan Krstic krstic@fas.harvard.edu | 0x147C722D

Tels

25 Oct 25 Oct

3:58 p.m.

New subject: Upload URL/filesystem restructuring

-----BEGIN PGP SIGNED MESSAGE-----

Moin ,

On Monday 24 October 2005 19:21, Ivan Krstic wrote:

...

Tels wrote:

...
although I am not sure why the "/P/" needs to be visible to the user (it is deterministic, after all), and it would be handy to have a "latest" revision URL. Which could be just: http://upload.wikimedia.org/wikipedia/en/Puppy.jpg and the software behind the back figures out what the exact latest revision is and under what /CapitcalLetter/ directory it falls.

Please read Brion's original message again.

Duh, sorry, I was mighty confused %~/ Please ignore any noise coming from me. :)

Best wishes,

Tels

- -- Signed on Tue Oct 25 17:58:14 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.

"Now, _you_ behave!"

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iQEVAwUBQ15WRHcLPEOTuEwVAQHTugf+KwCiW3D6PgtghxO8kRxWKRe3exek0jCG IjaolHK2ylW5em3o1q+IYAJ0QQzBBSmXiYBv96zdjZsJXDvnLfB+h5hi9nPPxgr0 wThO8/rA+iOJQdCaGVFDPh/s38OWMo4nYWRdkLxJs5TOs/tY/w/qaf78dLp7zUdk 05UJRRHqVmNpUlNd12szQwFVWPa8OB0oIaSHVlwXALLjlJN/7jMPrNYB914TlGTq mWMw9L2+hGlujGsp/8NsWagD7w7yO7tCT2UMZ9Ef5UkYc4fS202RqRgzzBT7qsJf SAGZoq/trc2DgOn0xyVP70sctN2STqDxjkKJzIbmyQu6OpDv3pGL4w== =oDVI -----END PGP SIGNATURE-----

Rowan Collins

24 Oct 24 Oct

6:03 p.m.

New subject: Upload URL/filesystem restructuring

Various people wrote: [I've picked this response just because it's the most recent, various people have been making similar points.]

...

...
So you might have something like: http://upload.wikimedia.org/584/590/5845907fdfc6eb1125129c4ce0da0704c49 6a7e4.jpg

Obviously a disadvantage is that the filenames are ugly. One might tack a 'pretty' but ignored filename on the end, using rewrites or whatever tool to drop it on the backend:

http://upload.wikimedia.org/584/590/5845907fdfc6eb1125129c4ce0da0704c49 6a7e4/Puppy.jpg

Which is still very human-unfriendly. I couldn't remember this URl even if my life depended on it!

The question that occurs to me is, quite simply, why do humans ever *need* to know or manipulate such URIs?

* anyone wanting to "bookmark" a particular image will want to link to its description page (to show copyright info, possible replacements, etc); if that's not the case, we need to redesign our image description pages (this may be the case w.r.t. Commons).

* anyone *saving* the image ("downloading" it, as they would describe it) would only see the *filename* part (as the default name); as long as we tack on the "friendly name" at the end (even just to ignore) the rest of the URI can be anything at all

* anyone wanting to include an image *inline* in an external site is abusing our bandwidth (either maliciously or just through naivety)

* somebody mentionned bot authors; but what purpose do bot authors have with the absolute URI of an image? Creating a static dump by screen-scraping rather than parsing the wikitext dump?

* a user-side renderer (e.g. WikiWyg, Pilaf's Live Preview) might need to know them to render fully, I suppose; like distinguishing "red" and "blue" internal links, this could ideally be done through some minimal "API", question and answer style

In short, I don't see any need for making these URIs "pretty", or even providing a Special page that redirects, except as a [bad] substitute for a "bot API" that allows you to request the current full URI. I *do*, however, see some very good reasons for tacking a pretty *filename* on as the last part, even if it's actually ignored (e.g. for "save as", as mentionned above).

Actually, we might want to do more than just ignore the pretty name, because (esp. knowing IE) it rmight be dangerous if http://..../abc123/Puppy.jpeg and http://..../abc123/Puppy.txt are valid URIs for the same file. To keep things static, we don't really want to check this explicitly, but perhaps the "pretty bits" could actually exist on the filesystem, as symlinks or such: http://.../abc123/abc123 [actual content] http://..../abc123/Puppy.jpeg [symlink to above] http://..../abc123/JSC0123.jpg [symlink to above; old or alternative name] http://..../abc123/Haxx0r.txt [no symlink here, so returns HTTP 404]

-- Rowan Collins BSc [IMSoP]

Brion Vibber

25 Oct 25 Oct

7:47 a.m.

New subject: Upload URL/filesystem restructuring

I'm putting further notes at: http://www.mediawiki.org/wiki/1.6_image_storage

-- brion vibber (brion @ pobox.com)

6843

Age (days ago)

6844

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Ivan Krstic
Nick Jenkins
Rowan Collins
Tels