TIFF support is coming!

List overview All Threads
Download

newer

older

Admins can now rename files!

Customizable forms using HTMLForm

David Gerard

11 Sep 2009 11 Sep '09

1:41 p.m.

http://techblog.wikimedia.org/2009/09/full-tiff-support-is-comming/

Funded by WMDE. Nice one :-)

- d.

Show replies by date

Charlotte Webb

11 Sep 11 Sep

3:08 p.m.

On Fri, Sep 11, 2009 at 1:41 PM, David Gerard dgerard@gmail.com wrote:

...

http://techblog.wikimedia.org/2009/09/full-tiff-support-is-comming/

Funded by WMDE. Nice one :-)

I don't suppose there's any easy way to use an automatic jpg thumbnail....... but only if the browser appears not to have an appropriate plug-in for displaying tiffs inline.

—C.W.

David Gerard

3:12 p.m.

2009/9/11 Charlotte Webb charlottethewebb@gmail.com:

...

On Fri, Sep 11, 2009 at 1:41 PM, David Gerard dgerard@gmail.com wrote:

...

...
http://techblog.wikimedia.org/2009/09/full-tiff-support-is-comming/ Funded by WMDE. Nice one :-)

...

I don't suppose there's any easy way to use an automatic jpg thumbnail....... but only if the browser appears not to have an appropriate plug-in for displaying tiffs inline.

Most wouldn't. But even for browsers with SVG support, MediaWiki thumbnails to PNG (because the browser SVG rendering is, at present, still very slow and rather buggy).

- d.

Charlotte Webb

3:28 p.m.

On Fri, Sep 11, 2009 at 3:12 PM, David Gerard dgerard@gmail.com wrote:

...

Most wouldn't. But even for browsers with SVG support, MediaWiki thumbnails to PNG (because the browser SVG rendering is, at present, still very slow and rather buggy).

Could be feature creep but I was just thinking those few whose browsers do support tiff might for whatever reason prefer thumbnails which are in fact a scaled tiff.

—C.W.

Daniel Kinzler

12 Sep 12 Sep

5:43 a.m.

Charlotte Webb schrieb:

...

On Fri, Sep 11, 2009 at 3:12 PM, David Gerard dgerard@gmail.com wrote:

...
Most wouldn't. But even for browsers with SVG support, MediaWiki thumbnails to PNG (because the browser SVG rendering is, at present, still very slow and rather buggy).

Could be feature creep but I was just thinking those few whose browsers do support tiff might for whatever reason prefer thumbnails which are in fact a scaled tiff.

I couldn't think of a good reason for wanting that. Direct access to the full TIFF, yes - but you have that by clicking through the description page to the file. That will trigger the browser's native rendering, just as with svg. I expect that for the things for wich you want native tiff support, you would want to have full resolution anyway.

-- daniel

John Vandenberg

11 Sep 11 Sep

6:25 p.m.

On Sat, Sep 12, 2009 at 6:08 AM, Charlotte Webb charlottethewebb@gmail.com wrote:

...

On Fri, Sep 11, 2009 at 1:41 PM, David Gerard dgerard@gmail.com wrote:

...
http://techblog.wikimedia.org/2009/09/full-tiff-support-is-comming/

Funded by WMDE. Nice one :-)

I don't suppose there's any easy way to use an automatic jpg thumbnail....... but only if the browser appears not to have an appropriate plug-in for displaying tiffs inline.

The DjVu and PDF handlers provides thumbnails as jpgs.

I would expect that the TIFF handler will also do this, otherwise it will not integrate with the Proofread Page extension.

-- John Vandenberg

peter green

12 Sep 12 Sep

6:43 a.m.

...

...
I don't suppose there's any easy way to use an automatic jpg thumbnail....... but only if the browser appears not to have an appropriate plug-in for displaying tiffs inline.

The DjVu and PDF handlers provides thumbnails as jpgs.

I would expect that the TIFF handler will also do this, otherwise it will not integrate with the Proofread Page extension.

On a related note is there any chance of getting the ability to specify thumbnail formats when adding an image to a page and/or determining what format to use huristically? We already support png which is a lossless format with wide support and better compression than TIFF but if you upload a photo as a png at the moment the thumbnails will also be pngs which is generally bad.

Gerard Meijssen

8:29 a.m.

Hoi, The objectivehere is NOT to have PNG but to support TIFF. This is what gets us credibility with our partners. This is what gives us the provenance that we care about representing material as it has been given to us. In this urging the use of PNG in stead of TIFF is counter productive. Thanks, GerardM

2009/9/12 peter green plugwash@p10link.net

...

...
...
I don't suppose there's any easy way to use an automatic jpg thumbnail....... but only if the browser appears not to have an appropriate plug-in for displaying tiffs inline.

The DjVu and PDF handlers provides thumbnails as jpgs.

I would expect that the TIFF handler will also do this, otherwise it will not integrate with the Proofread Page extension.

On a related note is there any chance of getting the ability to specify thumbnail formats when adding an image to a page and/or determining what format to use huristically? We already support png which is a lossless format with wide support and better compression than TIFF but if you upload a photo as a png at the moment the thumbnails will also be pngs which is generally bad.

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Yann Forget

11:38 a.m.

Gerard Meijssen wrote:

...

Hoi, The objectivehere is NOT to have PNG but to support TIFF. This is what gets us credibility with our partners. This is what gives us the provenance that we care about representing material as it has been given to us. In this urging the use of PNG in stead of TIFF is counter productive.

Nobody says otherwise. Please read again. What is suggested, and I think it is a Good Thing (R), is to render TIFF as JPEG.

...

Thanks, GerardM

Regards,

Yann

-- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

Daniel Kinzler

13 Sep 13 Sep

7:21 a.m.

peter green schrieb:

...

On a related note is there any chance of getting the ability to specify thumbnail formats when adding an image to a page and/or determining what format to use huristically? We already support png which is a lossless format with wide support and better compression than TIFF but if you upload a photo as a png at the moment the thumbnails will also be pngs which is generally bad.

We have acutally discussed this for TIFFs, it's an optional feature in the spec. I can't promise that we'll get it, but it would make sense, and shouldn't be hard to add even if it's not included in the first iteration.

-- daniel

John Vandenberg

11 Sep 11 Sep

6:23 p.m.

On Sat, Sep 12, 2009 at 4:41 AM, David Gerard dgerard@gmail.com wrote:

...

http://techblog.wikimedia.org/2009/09/full-tiff-support-is-comming/

Funded by WMDE. Nice one :-)

\o/

This is most relevant for Wikisource (of course..;-) . Many people walk away from Wikisource because they cant grapple with this DjVu format we love. The PDF handler has recently been installed, after waiting for two years. Here is an example of a PDF being transcribed on German Wikisource.

http://de.wikisource.org/wiki/Index:Geschichte_von_Berthelsdorf

These handlers of commonly used formats help other projects because many people dont know how to losslessly transcode between different formats, so it is better that they upload what they have and let someone else do the conversion.

There are many PDFs on English Wikipedia that have been deleted because they are "non-media without encyclopedic value" (groan).

For example, here is one I have now undeleted:

https://secure.wikimedia.org/wikipedia/en/wiki/File:2008_Sichuan_Earthquake_...

and put it onto the relevant article.

https://secure.wikimedia.org/wikipedia/en/w/index.php?title=2008_Sichuan_ear...

Hopefully the TIFF handler doesn't take 2 years to install like the PDF handler.

https://bugzilla.wikimedia.org/show_bug.cgi?id=11215

-- John Vandenberg

Federico Leva (Nemo)

12 Sep 12 Sep

2:39 a.m.

John Vandenberg, 12/09/2009 01:23:

...

This is most relevant for Wikisource (of course..;-) . Many people walk away from Wikisource because they cant grapple with this DjVu format we love.

Even with DjVu, the 100 MB limit is often too low: how can TIFF be useful with multi-paged huge documents?

Nemo

Gerard Meijssen

3:41 a.m.

Hoi. The one reason why we need TIFF documents is because museums and archives typically store their digitised material as TIFF. When we receive digitised material from a partnering organisation the best way to get it is as a TIFF. Usually we get them as JPG but when we have a relation like with the Tropenmuseum where we can request a super high resolution TIFF for restoration. It is important to have this picture as a TIFF on Commons because in that way we maintain a link with the original material. In this way we also prove that the restoration is a best effort practice to make historical material into something that retains its authenticity but is useful as an illustration.

Yes, 100MB is not big even for single paged documents. The biggest file we want to upload but cannot is over 600 Mb. Obviously they are a minority. Thanks, GerardM

2009/9/12 Federico Leva (Nemo) nemowiki@gmail.com

...

John Vandenberg, 12/09/2009 01:23:

...
This is most relevant for Wikisource (of course..;-) . Many people walk away from Wikisource because they cant grapple with this DjVu format we love.

Even with DjVu, the 100 MB limit is often too low: how can TIFF be useful with multi-paged huge documents?

Nemo

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Liam Wyatt

3:52 a.m.

On Sat, Sep 12, 2009 at 8:41 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:

...

Yes, 100MB is not big even for single paged documents. The biggest file we want to upload but cannot is over 600 Mb. Obviously they are a minority. Thanks, GerardM

Yes, 600Mb files will indeed be the minority of TIFF files - for a few

years yet at least :-) And, even then, TIFF files will be in the minority relative to JPG. *but* what makes supporting the TIFF format so important for the purposes of working with museums and galleries is that it demonstrates that we can do very good work (and take professional care of) their high quality images. If we can only support compressed formats then the museums and galleries can quite legitimately ask us why would we want high-res files. At that point it returns us to arguing over pixel-widths... So, not only is supporting lossless file formats good for WikiSource texts and good for image restorationists, it is also good for our future negotiations with museums and galleries.

So, keep up the good work guys!

-- wittylama.com/blog Peace, love & metadata

John Vandenberg

6:26 p.m.

On Sat, Sep 12, 2009 at 6:52 PM, Liam Wyatt liamwyatt@gmail.com wrote:

...

On Sat, Sep 12, 2009 at 8:41 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Yes, 100MB is not big even for single paged documents. The biggest file we want to upload but cannot is over 600 Mb. Obviously they are a minority. Thanks, GerardM

Yes, 600Mb files will indeed be the minority of TIFF files - for a few years yet at least :-) And, even then, TIFF files will be in the minority relative to JPG. *but* what makes supporting the TIFF format so important for the purposes of working with museums and galleries is that it demonstrates that we can do very good work (and take professional care of) their high quality images. If we can only support compressed formats then the museums and galleries can quite legitimately ask us why would we want high-res files. At that point it returns us to arguing over pixel-widths...

Wrong. TIFF is a container format. The contents in a TIFF file can be compressed or uncompressed, losslessly or lossy.

PNG/MNG is a container format. It can also have compressed or uncompressed, losslessly or lossy.

DJVU and PDF can also contain compressed or uncompressed chunks, losslessly or lossy.

...

So, not only is supporting lossless file formats good for WikiSource texts and good for image restorationists, it is also good for our future negotiations with museums and galleries.

TIFF support is good for future negotiations because it means that we don't need to argue with them about which file formats are better, or concern ourselves with the feasibility of transcoding large collections.

-- John Vandenberg

Gerard Meijssen

13 Sep 13 Sep

1:33 a.m.

Hoi, John you are right. TIFF can be everything you describe. The question I am left with what is your point to this? Material is scanned without compression by GLAM, we get it per standard as TIFF files, we restore them. When the material is compressed, we do not restore them. We need to retain the original to demonstrate provenance. It is problematic to have files nobody can see in a standard way. This is why we need TIFF support, because otherwise we are likely find an admin who starts deleting this essential material. Thanks, GerardM

2009/9/13 John Vandenberg jayvdb@gmail.com

...

On Sat, Sep 12, 2009 at 6:52 PM, Liam Wyatt liamwyatt@gmail.com wrote:

...
On Sat, Sep 12, 2009 at 8:41 AM, Gerard Meijssen <

gerard.meijssen@gmail.com>

...
wrote:

...
Yes, 100MB is not big even for single paged documents. The biggest file

we

...
...
want to upload but cannot is over 600 Mb. Obviously they are a minority. Thanks, GerardM

Yes, 600Mb files will indeed be the minority of TIFF files - for a few

years

...
yet at least :-) And, even then, TIFF files will be in the minority

relative

...
to JPG. *but* what makes supporting the TIFF format so important for the purposes of working with museums and galleries is that it demonstrates

that

...
we can do very good work (and take professional care of) their high

quality

...
images. If we can only support compressed formats then the museums and galleries can quite legitimately ask us why would we want high-res files.

At

...
that point it returns us to arguing over pixel-widths...

Wrong. TIFF is a container format. The contents in a TIFF file can be compressed or uncompressed, losslessly or lossy.

PNG/MNG is a container format. It can also have compressed or uncompressed, losslessly or lossy.

DJVU and PDF can also contain compressed or uncompressed chunks, losslessly or lossy.

...
So, not only is supporting lossless file formats good for WikiSource texts and good for image restorationists, it is also good for our future negotiations with museums and galleries.

TIFF support is good for future negotiations because it means that we don't need to argue with them about which file formats are better, or concern ourselves with the feasibility of transcoding large collections.

-- John Vandenberg

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

John Vandenberg

3:38 a.m.

On Sun, Sep 13, 2009 at 4:33 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, John you are right. TIFF can be everything you describe. The question I am left with what is your point to this? Material is scanned without compression by GLAM, we get it per standard as TIFF files, we restore them. When the material is compressed, we do not restore them. We need to retain the original to demonstrate provenance. It is problematic to have files nobody can see in a standard way. This is why we need TIFF support, because otherwise we are likely find an admin who starts deleting this essential material.

My point is that we *can* _losslessly_ transcode TIFF files to PNG/MNG.

Provenance requires that we know where the original digitised copy is (an identifier), but we don't need to have a copy of the original TIFF if we have an PNG with the same quality.

TIFF support means we don't need to worry about transcoding, or have fights about TIFF vs PNG vs PDF/A. That will be good, as it is a hurdle with working GLAMs, but it is not preventing high quality images or working with GLAMs, as transcoding is not a difficult process.

The main problem at the moment is the upload limit.

-- John Vandenberg

Gerard Meijssen

4:57 a.m.

Hoi, Remember the archive of Cologne? It collapsed. When we have as a policy to have the exact files on Commons as provided by a GLAM, we prove provenance because our best practice is to include the material in our archive as provided by a GLAM. When we decide for all kinds of reasons to transcode it to another format we can and should when it makes sense. It makes sense as long as we keep the original.

Your point that we do not need to have the original copy is wrong. Keeping the original files as a best practice and an important practice was confirmed in all the dealings I have had with many GLAMS. Thanks, GerardM

2009/9/13 John Vandenberg jayvdb@gmail.com

...

On Sun, Sep 13, 2009 at 4:33 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, John you are right. TIFF can be everything you describe. The question I

am

...
left with what is your point to this? Material is scanned without compression by GLAM, we get it per standard as TIFF files, we restore

them.

...
When the material is compressed, we do not restore them. We need to

retain

...
the original to demonstrate provenance. It is problematic to have files nobody can see in a standard way. This is why we need TIFF support,

because

...
otherwise we are likely find an admin who starts deleting this essential material.

My point is that we *can* _losslessly_ transcode TIFF files to PNG/MNG.

Provenance requires that we know where the original digitised copy is (an identifier), but we don't need to have a copy of the original TIFF if we have an PNG with the same quality.

TIFF support means we don't need to worry about transcoding, or have fights about TIFF vs PNG vs PDF/A. That will be good, as it is a hurdle with working GLAMs, but it is not preventing high quality images or working with GLAMs, as transcoding is not a difficult process.

The main problem at the moment is the upload limit.

-- John Vandenberg

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

John Vandenberg

5:16 a.m.

On Sun, Sep 13, 2009 at 7:57 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, Remember the archive of Cologne? It collapsed. When we have as a policy to have the exact files on Commons as provided by a GLAM, we prove provenance because our best practice is to include the material in our archive as provided by a GLAM. When we decide for all kinds of reasons to transcode it to another format we can and should when it makes sense. It makes sense as long as we keep the original.

We are not an archive.

If an archive collapses, they should take care of the items in their positions, gifting them to someone else who can take them. If they don't, they are a very poor archive.

-- John Vandenberg

Gerard Meijssen

5:26 a.m.

Hoi, The archive of Cologne was a state of the art archive in Germany. If anything it was the kind of institution that people would donate collections to because of its excellent reputation and practices. A similar story happened to an archive in Theresienstadt if I remember well. Cologne collapsed because of an accident with an underground that was dug, Theresienstadt saw its collection disappear in flames.

We are not an archive. But our best efforts help safeguard against the total loss of an archive. We can be complementary to what our partners do. We should not think so insular. Thanks, GerardM

2009/9/13 John Vandenberg jayvdb@gmail.com

...

On Sun, Sep 13, 2009 at 7:57 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, Remember the archive of Cologne? It collapsed. When we have as a policy

to

...
have the exact files on Commons as provided by a GLAM, we prove

provenance

...
because our best practice is to include the material in our archive as provided by a GLAM. When we decide for all kinds of reasons to transcode

it

...
to another format we can and should when it makes sense. It makes sense

as

...
long as we keep the original.

We are not an archive.

If an archive collapses, they should take care of the items in their positions, gifting them to someone else who can take them. If they don't, they are a very poor archive.

-- John Vandenberg

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

John Vandenberg

5:36 a.m.

On Sun, Sep 13, 2009 at 8:26 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, The archive of Cologne was a state of the art archive in Germany. If anything it was the kind of institution that people would donate collections to because of its excellent reputation and practices. A similar story happened to an archive in Theresienstadt if I remember well. Cologne collapsed because of an accident with an underground that was dug, Theresienstadt saw its collection disappear in flames.

We are not an archive. But our best efforts help safeguard against the total loss of an archive. We can be complementary to what our partners do. We should not think so insular.

If they do not have proper archival practices (i.e. redundancy/tape backup), we should tell them about Internet Archive.

It is insular to think we need to be the solution for problems we are not well prepared to solve.

-- John Vandenberg

David Gerard

10:22 a.m.

2009/9/13 John Vandenberg jayvdb@gmail.com:

...

On Sun, Sep 13, 2009 at 8:26 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

...
The archive of Cologne was a state of the art archive in Germany. If anything it was the kind of institution that people would donate collections to because of its excellent reputation and practices. A similar story happened to an archive in Theresienstadt if I remember well. Cologne collapsed because of an accident with an underground that was dug, Theresienstadt saw its collection disappear in flames. We are not an archive. But our best efforts help safeguard against the total loss of an archive. We can be complementary to what our partners do. We should not think so insular.

...

If they do not have proper archival practices (i.e. redundancy/tape backup), we should tell them about Internet Archive. It is insular to think we need to be the solution for problems we are not well prepared to solve.

No indeed. However, it certainly doesn't hurt and archival copies of this stuff is certainly within our remit.

(Though the Internet Archive are freer on what licenses they accept and will gleefully accept archives of just about anything.)

- d.

Federico Leva (Nemo)

12 Sep 12 Sep

4:31 a.m.

Gerard Meijssen, 12/09/2009 10:41:

...

The one reason why we need TIFF documents is because museums and archives typically store their digitised material as TIFF.

I know (all of us know), but John pointed out that TIFF is useful also for Wikisource. :-)

Nemo

John Vandenberg

6:18 p.m.

On Sat, Sep 12, 2009 at 7:31 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Gerard Meijssen, 12/09/2009 10:41:

...
The one reason why we need TIFF documents is because museums and archives typically store their digitised material as TIFF.

I know (all of us know), but John pointed out that TIFF is useful also for Wikisource. :-)

People say "archives", but do not recognise or care that archives consist primarily of *text*.

-- John Vandenberg

Federico Leva (Nemo)

14 Sep 14 Sep

5:05 a.m.

John Vandenberg, 13/09/2009 01:18:

...

On Sat, Sep 12, 2009 at 7:31 PM, Federico Leva (Nemo)

...
I know (all of us know), but John pointed out that TIFF is useful also for Wikisource. :-)

People say "archives", but do not recognise or care that archives consist primarily of *text*.

That's obvious, and [scanned] texts are more valuable than images, but precisely for this reason it's more difficult to manage them. We can relatively easily upload a bunch of images, tag and categorize them, use them in galleries and Wikipedia+Wikiquote articles, correct errors etc. Thus we are focusing on images "donations" which can give an immediate result. But what could we do with a bunch of scanned texts? (The only partnership here was the one of WM-FR with ENVT, if I remember correctly)? See http://strategy.wikimedia.org/wiki/Proposal:Make_Wikisource_scale

I don't think that here the problem is file format: large transcoding is feasibile, see Internet Archive, where you can put in a huge PDF and download a wonderful DjVu: http://www.archive.org/details/VocabolarioAccademiciCruscaEd3Vol3 (they don't use TIFF, I suppose because it's disk space-demanding and they have already JPEG images), although they use non free commercial software.

Nemo

Daniel Kinzler

12 Sep 12 Sep

5:41 a.m.

Federico Leva (Nemo) schrieb:

...

John Vandenberg, 12/09/2009 01:23:

...
This is most relevant for Wikisource (of course..;-) . Many people walk away from Wikisource because they cant grapple with this DjVu format we love.

Even with DjVu, the 100 MB limit is often too low: how can TIFF be useful with multi-paged huge documents?

I don't think we'll see very many huge TIFF files. But generally, uploading very large files is possible by doing a server side import. Setting up a process for this, and perhaps some nice tools, is something I have been thinking about. perhaps we'll offer a contract for that next year :)

-- daniel

Platonides

5:49 p.m.

Daniel Kinzler wrote:

...

I don't think we'll see very many huge TIFF files. But generally, uploading very large files is possible by doing a server side import. Setting up a process for this, and perhaps some nice tools, is something I have been thinking about. perhaps we'll offer a contract for that next year :)

-- daniel

Process? Tools? It would just be making a 'bigupload' right for people to bypass file size restrictions (or have a extremely high one). Then give it to sysops or a new group.

The only thing is not to get those users uploading giantic files when storage nodes are getting out of space. It could even be automated so bigupload right would only be effective if there's at least X or X% free space on disk.

Of course we would also need an interface able at least to continue interrupted uploads, to make it really useful. I did a proposal years ago based on a FTP upload interface. Maybe you are referring to something similar. Please keep me posted. Upload from URL and Firefogg should alleviate the issue, though.

Daniel Kinzler

13 Sep 13 Sep

7:29 a.m.

Platonides schrieb:

...

Process? Tools? It would just be making a 'bigupload' right for people to bypass file size restrictions (or have a extremely high one). Then give it to sysops or a new group.

Tell me if I'm wrong, but as far as I know, the file size is limited by PHP, nto by MediaWiki. And it has to be: if we would admit huge files to be uploaded before they are finally rejected by MediaWiki, this would already be an attack vector - because, afaik, PHP got the dumb idea of buffering uploads in RAM. So, to kill the server, just upload a 5GB file.

...

The only thing is not to get those users uploading giantic files when storage nodes are getting out of space. It could even be automated so bigupload right would only be effective if there's at least X or X% free space on disk.

If this worked, that would be cool. But as I said, afaik it's not possible, for technical reasons.

...

Of course we would also need an interface able at least to continue interrupted uploads, to make it really useful.

That would be helpful. ALso helpful would be the ability to upload archive files containing multiple images. If we have a way to deal with uploading big files, this would become feasible.

...

I did a proposal years ago based on a FTP upload interface. Maybe you are referring to something similar. Please keep me posted. Upload from URL and Firefogg should alleviate the issue, though.

A relatively simple way would be to allow big files to be uploaded via FTP or any other protocol, to "dumb storage", and then transfer and import them server side. I'd propose a ticket system for this: people with a special right can generate a ticket good for uploading a one file, for instance. But it's just an idea so far.

-- daniel

Platonides

4:56 p.m.

New subject: Different upload systems (was: TIFF support is coming!)

Daniel Kinzler schrieb:

...

Platonides schrieb:

...
Process? Tools? It would just be making a 'bigupload' right for people to bypass file size restrictions (or have a extremely high one). Then give it to sysops or a new group.

Tell me if I'm wrong, but as far as I know, the file size is limited by PHP, nto by MediaWiki. And it has to be: if we would admit huge files to be uploaded before they are finally rejected by MediaWiki, this would already be an attack vector - because, afaik, PHP got the dumb idea of buffering uploads in RAM. So, to kill the server, just upload a 5GB file.

Really? It makes sense for text POSTs but it's not very smart for files...

...

...
Of course we would also need an interface able at least to continue interrupted uploads, to make it really useful.

That would be helpful. ALso helpful would be the ability to upload archive files containing multiple images. If we have a way to deal with uploading big files, this would become feasible.

...
I did a proposal years ago based on a FTP upload interface. Maybe you are referring to something similar. Please keep me posted. Upload from URL and Firefogg should alleviate the issue, though.

A relatively simple way would be to allow big files to be uploaded via FTP or any other protocol, to "dumb storage", and then transfer and import them server side. I'd propose a ticket system for this: people with a special right can generate a ticket good for uploading a one file, for instance. But it's just an idea so far.

-- daniel

I was thinking on a FTP server where you log in with your wiki credentials and get to a private temporary folder. You can view pending files, delete, rename, append and create new ones (you can't read them though, to avoid being used as a sharing service). You are given a quota so you could upload a few large files or many small ones. Files get deleted after X time untouched. When you go to the page name it would have on the wiki there's a message reminding you of a pending upload an inviting you to finish it, where you get the normal upload fields. After transferring, the file gets public and you are returned the file size quota. Having a specific protocol for uploads also allow to store them directly on the storage nodes, instead of writing them via nfs from the apaches.

Daniel Kinzler

14 Sep 14 Sep

1:58 a.m.

New subject: Different upload systems

Platonides schrieb:

...

...
Tell me if I'm wrong, but as far as I know, the file size is limited by PHP, nto by MediaWiki. And it has to be: if we would admit huge files to be uploaded before they are finally rejected by MediaWiki, this would already be an attack vector - because, afaik, PHP got the dumb idea of buffering uploads in RAM. So, to kill the server, just upload a 5GB file.

Really? It makes sense for text POSTs but it's not very smart for files...

I didn't check for myself, but that'S what I was told when we discussed this matter with Brion and Mark at FOSDEM.

And yes, it's utterly stupid. But that doesn't mean PHP won't do it.

...

I was thinking on a FTP server where you log in with your wiki credentials and get to a private temporary folder. You can view pending files, delete, rename, append and create new ones (you can't read them though, to avoid being used as a sharing service). You are given a quota so you could upload a few large files or many small ones. Files get deleted after X time untouched. When you go to the page name it would have on the wiki there's a message reminding you of a pending upload an inviting you to finish it, where you get the normal upload fields. After transferring, the file gets public and you are returned the file size quota. Having a specific protocol for uploads also allow to store them directly on the storage nodes, instead of writing them via nfs from the apaches.

Yes, that sounds pretty nice.

-- daniel

Michael Dale

9:34 a.m.

New subject: Different upload systems

Daniel Kinzler

9:50 a.m.

New subject: Different upload systems

Michael Dale schrieb:

...

I would add systems like Firefogg help a bit with chunked uploading letting us accept larger files by breaking it up into 1 MB POST chunks on the client side. Also new-upload-api has improved http copy upload support so if the 500MB Tiff or 5GB .ogg was ftp uploaded to a remote server with http serving, it could then be ingested over http download rather than POST upload. (which would be more resource friendly / reliable with large files)

--michael

On a related note... have you looked at the AtomPub spec for uploading? I dodn't look too closely, but it seems worth considering.

-- daniel

Michael Dale

12:28 p.m.

New subject: Different upload systems

[snip]

...

On a related note... have you looked at the AtomPub spec for uploading? I dodn't look too closely, but it seems worth considering.

Had not seen it before.. looks good but I imagine its use would be dictated via a particular content partner upload in that ingestion format and us having the time resources to enhance the mediaWiki uploading system in a way that took advantage of the AtomPub spec. (ie not very likely to happen) ... probably easier for the content provider to write something custom for wikimedia then for us to try and write a general collections api / AtomPub support (that assumes AtomPub is supported in some client/content provider side archive systems... which I am not sure is a reality.

--michael

Platonides

1:18 p.m.

New subject: Different upload systems

Michael Dale wrote:

...

I would add systems like Firefogg help a bit with chunked uploading letting us accept larger files by breaking it up into 1 MB POST chunks on the client side. Also new-upload-api has improved http copy upload support so if the 500MB Tiff or 5GB .ogg was ftp uploaded to a remote server with http serving, it could then be ingested over http download rather than POST upload. (which would be more resource friendly / reliable with large files)

--michael

Not really an addition since I already mentioned them ;)

...

Upload from URL and Firefogg should alleviate the issue, though.

Are the "1 MB POST chunks" just rfc2616 chunked transfer-coding applied to a POST body, or is it a custom protocol using several POSTs? There's a upload.txt file on docs/ but it's not any helpful.

Michael Dale

2:11 p.m.

New subject: Different upload systems

Platonides wrote:

...

Not really an addition since I already mentioned them ;)

[sorry tuned in with the title change]

...

Are the "1 MB POST chunks" just rfc2616 chunked transfer-coding applied to a POST body, or is it a custom protocol using several POSTs? There's a upload.txt file on docs/ but it's not any helpful.

Hmm I don't think chunk transfer-coding helps the php putting the temporary POST file into memory issue. So we do it as several POSTS.

I did not write that upload.txt ...but more upload documentation would be good.. its basically documented at the api level right now and the chunk "protocol" is on on firefogg.org:: http://www.firefogg.org/dev/chunk_post.html

--michael

Yann Forget

18 Sep 18 Sep

4:34 p.m.

New subject: Different upload systems

Hello,

Platonides wrote:

...

I was thinking on a FTP server where you log in with your wiki credentials and get to a private temporary folder. You can view pending files, delete, rename, append and create new ones (you can't read them though, to avoid being used as a sharing service). You are given a quota so you could upload a few large files or many small ones. Files get deleted after X time untouched. When you go to the page name it would have on the wiki there's a message reminding you of a pending upload an inviting you to finish it, where you get the normal upload fields. After transferring, the file gets public and you are returned the file size quota. Having a specific protocol for uploads also allow to store them directly on the storage nodes, instead of writing them via nfs from the apaches.

Internet Archive has such a system, so it might worth asking them rather than reinventing hot water.

http://www.archive.org/services/contrib-submit.php (for simpler cases), or http://www.us.archive.org/contrib_submit.php (for more complex ones)

(Adding files to an existing item might make this a "more complex" case, requiring use of replacing=1 or update_mode=1.)

Information on use of http://www.archive.org/services/contrib-submit.php is available at: http://www.archive.org/help/contrib-advanced.php

and information on use of http://www.us.archive.org/contrib_submit.php is available at: http://www.us.archive.org/contrib_submit.php?help=1

Yann

5421

Age (days ago)

5428

Last active (days ago)

commons-l@lists.wikimedia.org

35 comments

11 participants

tags (0)

participants (11)

Charlotte Webb
Daniel Kinzler
David Gerard
Federico Leva (Nemo)
Gerard Meijssen
John Vandenberg
Liam Wyatt
Michael Dale
peter green
Platonides
Yann Forget