Time and again, the 100 MB limit on file uploads is a problem, in particular for multipage documents (scanned books) in PDF or Djvu, and for video files in OGV.
What are the plans for increasing this limit? Would it be possible to allow 500 MB or 1 GB for these file formats, and maintain the lower limit for other formats?
2010/7/20 Lars Aronsson lars@aronsson.se:
Time and again, the 100 MB limit on file uploads is a problem, in particular for multipage documents (scanned books) in PDF or Djvu, and for video files in OGV.
What are the plans for increasing this limit? Would it be possible to allow 500 MB or 1 GB for these file formats, and maintain the lower limit for other formats?
There is support for chunked uploading in MediaWiki core, but it's disabled for security reasons AFAIK. With chunked uploading, you're uploading your file in chunks of 1 MB, which means that the impact of failure for large uploads is vastly reduced (if a chunk fails, you just reupload that chunk) and that progress bars can be implemented. This does need client-side support, e.g. using the Firefogg extension for Firefox or a bot framework that knows about chunked uploads. This probably means the upload limit can be raised, but don't quote me on that.
Roan Kattouw (Catrope)
On 07/20/2010 04:30 PM, Roan Kattouw wrote:
This does need client-side support, e.g. using the Firefogg extension for Firefox or a bot framework that knows about chunked uploads.
Requiring special client software is a problem. Is that really the only possible solution?
I understand that a certain webserver or PHP configuration can be a problem, in that it might receive the entire file in /tmp (that might get full) before returning control to some upload.php script. But I don't see why HTTP in itself would set a limit at 100 MB. What decides this particular limit? Why isn't it 50 MB or 200 MB?
Some alternatives would be to open a separate anonymous FTP upload ("requires special client software" -- from the 1980s, still in use by the Internet Archive) or a get-from-URL (server would download the file by HTTP GET from the user's server at a specified URL).
2010/7/20 Max Semenik maxsem.wiki@gmail.com:
On 20.07.2010, 19:12 Lars wrote:
Requiring special client software is a problem. Is that really the only possible solution?
There's also Flash that can do it, however it's being ignored due to its proprietary nature.
Java applet?
Roan Kattouw (Catrope)
Roan Kattouw wrote:
2010/7/20 Max Semenik maxsem.wiki@gmail.com:
On 20.07.2010, 19:12 Lars wrote:
Requiring special client software is a problem. Is that really the only possible solution?
There's also Flash that can do it, however it's being ignored due to its proprietary nature.
Java applet?
Roan Kattouw (Catrope)
Or a modern browser using FileReader.
http://hacks.mozilla.org/2010/06/html5-adoption-stories-box-net-and-html5-dr...
I hope to begin to address this problem with the new UploadWizard, at least the frontend issues. This isn't really part of our mandate, but I am hoping to add in chunked uploads for bleeding-edge browsers like Firefox 3.6+ and 4.0. Then you can upload files of whatever size you want.
I've written it to support what I'm calling multiple "transport" mechanisms; some using simple HTTP uploads, and some more exotic methods like Mozilla's FileAPI.
At this point, we're not considering adding any new technologies like Java or Flash to the mix, although these are the standard ways that people do usable uploads on the web. Flash isn't considered open enough, and Java seemed like a radical break.
I could see a role for "helper" applets or SWFs, but it's not on the agenda at this time. Right now we're trying to deliver something that fits the bill, using standard MediaWiki technologies (HTML, JS, and PHP).
I'll post again to the list if I get a FileAPI upload working. Or, if someone is really interested, I'll help them get started.
On 7/20/10 11:28 AM, Platonides wrote:
Roan Kattouw wrote:
2010/7/20 Max Semenikmaxsem.wiki@gmail.com:
On 20.07.2010, 19:12 Lars wrote:
Requiring special client software is a problem. Is that really the only possible solution?
There's also Flash that can do it, however it's being ignored due to its proprietary nature.
Java applet?
Roan Kattouw (Catrope)
Or a modern browser using FileReader.
http://hacks.mozilla.org/2010/06/html5-adoption-stories-box-net-and-html5-dr...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, Jul 20, 2010 at 2:28 PM, Platonides Platonides@gmail.com wrote:
Or a modern browser using FileReader.
http://hacks.mozilla.org/2010/06/html5-adoption-stories-box-net-and-html5-dr...
This would be best, but unfortunately it's not yet usable for large files -- it has to read the entire file into memory on the client. This post discusses a better interface that's being deployed:
http://hacks.mozilla.org/2010/07/firefox-4-formdata-and-the-new-file-url-obj...
But I don't think it actually addresses our use-case. We'd want the ability to slice up a File object into Blobs and handle those separately, and I don't see it in the specs. I'll ask. Anyway, I don't think this is feasible just yet, sadly.
On 7/20/10 6:34 PM, Aryeh Gregor wrote:
On Tue, Jul 20, 2010 at 2:28 PM, PlatonidesPlatonides@gmail.com wrote:
Or a modern browser using FileReader.
http://hacks.mozilla.org/2010/06/html5-adoption-stories-box-net-and-html5-dr...
This would be best, but unfortunately it's not yet usable for large files -- it has to read the entire file into memory on the client. [...] But I don't think it actually addresses our use-case. We'd want the ability to slice up a File object into Blobs and handle those separately, and I don't see it in the specs. I'll ask. Anyway, I don't think this is feasible just yet, sadly.
Here's a demo which implements an EXIF reader for JPEGs in Javascript, which reads the file as a stream of bytes.
http://demos.hacks.mozilla.org/openweb/FileAPI/
So, as you can see, we do have a form of BLOB access.
So you're right that these newer Firefox File* APIs aren't what we want for uploading extremely large images (>50MB or so). But I can easily see using this to slice up anything smaller for chunk-oriented APIs.
On Wed, Jul 21, 2010 at 12:31 AM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
Here's a demo which implements an EXIF reader for JPEGs in Javascript, which reads the file as a stream of bytes.
http://demos.hacks.mozilla.org/openweb/FileAPI/
So, as you can see, we do have a form of BLOB access.
But only by reading the whole file into memory, right? That doesn't adequately address the use-case we're discussing in this thread (uploading files > 100 MB in chunks).
On 7/21/10 8:16 AM, Aryeh Gregor wrote:
So, as you can see, we do have a form of BLOB access.
But only by reading the whole file into memory, right? That doesn't adequately address the use-case we're discussing in this thread (uploading files> 100 MB in chunks).
That's what I said in the very next paragraph. :)
I just wanted to clarify that yes, we do have native chunking in some browsers; no, it's not adequate for very large media files.
It seems to me we're at an impasse unless we use one or more of the following approaches:
- We compromise on free software purity and use Flash helpers, etc (assuming that approach works okay for ultra-large files).
- Something else (Java applets, Firefogg or other custom browser plugins)
- Mozilla, Chrome, or some other browser maker steps up with a plausible API. We should make the case to them. We see the Moz people regularly at the WMF and I know a few Chrome people, so maybe I can kick something off.
On Wed, Jul 21, 2010 at 12:51 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:
That's what I said in the very next paragraph. :)
I just wanted to clarify that yes, we do have native chunking in some browsers; no, it's not adequate for very large media files.
Yeah, I was just confused because I thought I said exactly the same thing. :)
- Mozilla, Chrome, or some other browser maker steps up with a plausible
API. We should make the case to them. We see the Moz people regularly at the WMF and I know a few Chrome people, so maybe I can kick something off.
This is the right place to bring it up:
http://lists.w3.org/Archives/Public/public-webapps/
I think the right API change would be to just allow slicing a Blob up into other Blobs by byte range. It should be simple to both spec and implement. But it might have been discussed before, so best to look in the archives first.
On Wed, Jul 21, 2010 at 2:05 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
This is the right place to bring it up:
http://lists.w3.org/Archives/Public/public-webapps/
I think the right API change would be to just allow slicing a Blob up into other Blobs by byte range. It should be simple to both spec and implement. But it might have been discussed before, so best to look in the archives first.
Aha, I finally found it. It's in the spec already:
http://dev.w3.org/2006/webapi/FileAPI/#dfn-slice
So once you have a File object, you should be able to call file.slice(pos, 1024*1024) to get a Blob object that's 1024*1024 bytes long starting at pos. Of course, this surely won't be reliably available in all browsers for several years yet, so best not to pin our hopes on it. Chrome apparently implements some or all of the File API in version 6, but I can't figure out if it includes this part. Firefox doesn't yet according to MDC.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 37-01--10 03:59 PM, Roan Kattouw wrote:
Java applet?
I was just about to say: surely a FLOSS & decent Java applet has already been written?!
Or maybe I have too much hope :D
- -Mike
On Tue, Jul 20, 2010 at 7:19 PM, Max Semenik maxsem.wiki@gmail.com wrote:
There's also Flash that can do it, however it's being ignored due to its proprietary nature.
May we drop our ideological concerns and implement multiple ways of uploading, including Flash and Java applets?
--vvv
Hello Lars,
I don´t think the problem is raising it to 200mb, or 150 mb but 500mb or 1 gb are a lot higher and can cause problems
Anomynous ftp access sounds like a very very very bad and evil solution...
Lars Aronsson schrieb:
On 07/20/2010 04:30 PM, Roan Kattouw wrote:
This does need client-side support, e.g. using the Firefogg extension for Firefox or a bot framework that knows about chunked uploads.
Requiring special client software is a problem. Is that really the only possible solution?
It appears to be the only good solution for really large files. Anything with a progress bar requires client side support. Flash or a Java applet would be enough, but both suck pretty badly.
I'd say: if people really need to upload huge files, it's ok to ask them to install a browser plugin.
I understand that a certain webserver or PHP configuration can be a problem, in that it might receive the entire file in /tmp (that might get full) before returning control to some upload.php script.
IIRC, PHP even tends to buffer the entire file in RAM(!) before writing it to /tmp. Which is totally insane, but hey, it's PHP. I think that was the original reason behind the low limit, but I might be wrong.
But I don't see why HTTP in itself would set a limit at 100 MB.
HTTP itself doesn't. I guess as long as we stay in the 31 bit range (about 2GB), HTTP will be fine. Larger files may cause overflows in sloppy software.
However, HTTP doesn't allow people to resume uploads or watch progress (the latter could be done by browsers - sadly, I have never seen it). Thus, it sucks for very large files.
What decides this particular limit? Why isn't it 50 MB or 200 MB?
I think it was raised from 20 to 100 a year or two ago. It could be raised a bit again i guess, but a real solution for really large files would be better, don't you think?
Some alternatives would be to open a separate anonymous FTP upload ("requires special client software" -- from the 1980s, still in use by the Internet Archive) or a get-from-URL (server would download the file by HTTP GET from the user's server at a specified URL).
To make this sane and safe, with making sure we always know which user did what, etc, would be quite expensive. I have been thinking about this kind of thing for mass uploads (i.e. uploads a TAR via ftp, have it unpack on the server, import). But that's another barrel of fish. Finishing chunked upload is better for the average user (using FTP to upload stuff is harder on the avarage guzy than installing a firefox plugin...)
-- daniel
On 20 July 2010 17:15, Daniel Kinzler daniel@brightbyte.de wrote:
It appears to be the only good solution for really large files. Anything with a progress bar requires client side support.
Chrome/Chromium has an upload progress indicator, and it's excellent. Really the best browser to use for uploading big stuff. So the client-side problem is being solved.
- d.
On 21/07/10 00:30, Roan Kattouw wrote:
There is support for chunked uploading in MediaWiki core, but it's disabled for security reasons AFAIK. With chunked uploading, you're uploading your file in chunks of 1 MB, which means that the impact of failure for large uploads is vastly reduced (if a chunk fails, you just reupload that chunk) and that progress bars can be implemented. This does need client-side support, e.g. using the Firefogg extension for Firefox or a bot framework that knows about chunked uploads. This probably means the upload limit can be raised, but don't quote me on that.
Firefogg support has been moved out to an extension, and that extension was not complete last time I checked. There was chunked upload support in the API, but it was Firefogg-specific, no client-neutral protocol has been proposed. The Firefogg chunking protocol itself is poorly thought-out and buggy, it's not the sort of thing you'd want to use by choice, with a non-Firefogg client.
Note that it's not necessary to use Firefogg to get chunked uploads, there are lots of available technologies which users are more likely to have installed already. See the "chunking" line in the support matrix at http://www.plupload.com/
When I reviewed Firefogg, I found an extremely serious CSRF vulnerability in it. They say they have fixed it now, but I'd still be more comfortable promoting better-studied client-side extensions, if we have to promote a client-side extension at all.
-- Tim Starling
On 7/20/10 8:08 PM, Tim Starling wrote:
The Firefogg chunking protocol itself is poorly thought-out and buggy, it's not the sort of thing you'd want to use by choice, with a non-Firefogg client.
What in your view would a better version look like?
The PLupload protocol seems quite similar. I might be missing some subtle difference.
I'd still be more comfortable promoting better-studied client-side extensions, if we have to promote a client-side extension at all.
I don't think we should be relying on extensions per se. Firefogg does do some neat things nothing else does, like converting video formats. But it's never going to be installed by a larger percentage of our users.
As far as making uploads generally easier, PLupload's approach is way more generic since it abstracts away the "helper" technologies. It will work out of the box for maybe >99% of the web and provides a path to eventually transitioning to pure JS solutions. It's a really interesting approach and the design looks very clean. I wish I'd known about it before I started this project.
That said, it went public in early 2010, and a quick visit to its forums will show that it's not yet bug-free software either.
Anyway, thanks for the URL. We've gone the free software purist route with our uploader, but we may yet learn something from PLuploader or incorporate some of what it does.
On 07/20/2010 10:08 PM, Tim Starling wrote:
Firefogg support has been moved out to an extension, and that extension was not complete last time I checked. There was chunked upload support in the API, but it was Firefogg-specific, no client-neutral protocol has been proposed. The Firefogg chunking protocol itself is poorly thought-out and buggy, it's not the sort of thing you'd want to use by choice, with a non-Firefogg client.
We did request feedback for the protocol. We wanted to keep it simple. We are open to constructive dialog for improvement.
When I reviewed Firefogg, I found an extremely serious CSRF vulnerability in it. They say they have fixed it now, but I'd still be more comfortable promoting better-studied client-side extensions, if we have to promote a client-side extension at all.
Yes there was a CSRF for a recently added new feature, It was fixed and had an update deployed within hours of it being reported, that was like over a year ago now? Firefogg has been reviewed it has thousands of users. We are happy to do more reviewing. At one point we did some review with some Mozilla add-on folks, and we are happy to do that process again. That is of course if a CSRF from a year ago does not permanently make the extension a lost cause?
peace, --michael
Lars Aronsson schrieb:
What are the plans for increasing this limit? Would it be possible to allow 500 MB or 1 GB for these file formats, and maintain the lower limit for other formats?
As far as I know, we are hitting the limits of http here. Increasing the upload limit as such isn't a solution, and a per-file-type setting doesn't help, since the limit strikes before php is even started. It's on the server level.
The solution are "chunked uploads". Which people have been working on for a while, but I have no idea what the current status is.
-- daniel
A few points:
* The reason for the 100meg limit has to do with php and apache and how it stores the uploaded POST in memory so setting the limit higher would risk increasing chances of apaches hitting swap if multiple uploads happened on a given box.
* Modern html5 browsers are starting to be able to natively split files up into chunks and do separate 1 meg xhr posts. Firefogg extension does something similar with extension javascript.
* The server side chunk uploading api support was split out into an extension by Mark Hershberger ( cc'ed )
* We should really get the chunk uploading "reviewed" and deployed. Tim expressed some concerns with the chunk uploading protocol which we addressed client side, but I don't he had time to follow up with proposed changes that we made for server api. At any rate I think the present protocol is better than normal http POST for large files. We get lots of manageable 1 meg chunks and a reset connection does not result in re-sending the whole file, and it works with vanilla php and apache ( other resumable http upload protocols are more complicated and require php or apache mods )
* Backed storage system will not be able to handle a large influx of large files for an extended period of time. All of commons is only 10 TB or so and is on a "single" storage system. So acompaning an increase of upload size should be an effort / plan to re-architect the backed storage system.
--michael
On 07/20/2010 09:32 AM, Daniel Kinzler wrote:
Lars Aronsson schrieb:
What are the plans for increasing this limit? Would it be possible to allow 500 MB or 1 GB for these file formats, and maintain the lower limit for other formats?
As far as I know, we are hitting the limits of http here. Increasing the upload limit as such isn't a solution, and a per-file-type setting doesn't help, since the limit strikes before php is even started. It's on the server level.
The solution are "chunked uploads". Which people have been working on for a while, but I have no idea what the current status is.
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 7/20/10 9:57 AM, Michael Dale wrote:
- The reason for the 100meg limit has to do with php and apache and how
it stores the uploaded POST in memory so setting the limit higher would risk increasing chances of apaches hitting swap if multiple uploads happened on a given box.
I've heard others say that -- this may have been true before, but I'm pretty sure it's not true any in PHP 5.2 or greater.
I've been doing some tests with large uploads (around 50MB) and I don't observe any Apache process getting that large. Instead it writes a temporary file. I checked out the source where it handles uploads and they seem to be taking care not to slurp the whole thing into memory. (lines 1061-1106)
http://svn.php.net/viewvc/php/php-src/trunk/main/rfc1867.c?view=markup
So, there may be other reasons not to upload a very large file, but I don't think this is one of them.
Michael Dale mdale@wikimedia.org writes:
- Modern html5 browsers are starting to be able to natively split files
up into chunks and do separate 1 meg xhr posts. Firefogg extension does something similar with extension javascript.
Could you point me to the specs that the html5 browsers are using? Would it be possible to just make Firefogg mimic this same protocol for pre-html5 Firefox?
- We should really get the chunk uploading "reviewed" and deployed. Tim
expressed some concerns with the chunk uploading protocol which we addressed client side, but I don't he had time to follow up with proposed changes that we made for server api.
If you can point me to Tim's proposed server-side changes, I'll have a look.
Mark.
On Wed, Jul 21, 2010 at 11:19 AM, Mark A. Hershberger mah@everybody.org wrote:
Could you point me to the specs that the html5 browsers are using? Would it be possible to just make Firefogg mimic this same protocol for pre-html5 Firefox?
The relevant spec is here:
Firefox 3.6 doesn't implement it exactly, since it was changed after Firefox's implementation, but the changes should mostly be compatible (as I understand it). But it's not good enough for large files, since it has to read them into memory.
But anyway, what's the point in telling people to install an extension if we can just tell them to upgrade Firefox? Something like two-thirds of our Firefox users are already on 3.6:
http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm
On 21/07/10 00:32, Daniel Kinzler wrote:
Lars Aronsson schrieb:
What are the plans for increasing this limit? Would it be possible to allow 500 MB or 1 GB for these file formats, and maintain the lower limit for other formats?
As far as I know, we are hitting the limits of http here. Increasing the upload limit as such isn't a solution, and a per-file-type setting doesn't help, since the limit strikes before php is even started. It's on the server level.
The problem is just that increasing the limits in our main Squid and Apache pool would create DoS vulnerabilities, including the prospect of "accidental DoS". We could offer this service via another domain name, with a specially-configured webserver, and a higher level of access control compared to ordinary upload to avoid DoS, but there is no support for that in MediaWiki.
We could theoretically allow uploads of several gigabytes this way, which is about as large as we want files to be anyway. People with flaky internet connections would hit the problem of the lack of resuming, but it would work for some.
-- Tim Starling
Tim Starling wrote:
The problem is just that increasing the limits in our main Squid and Apache pool would create DoS vulnerabilities, including the prospect of "accidental DoS". We could offer this service via another domain name, with a specially-configured webserver, and a higher level of access control compared to ordinary upload to avoid DoS, but there is no support for that in MediaWiki.
We could theoretically allow uploads of several gigabytes this way, which is about as large as we want files to be anyway. People with flaky internet connections would hit the problem of the lack of resuming, but it would work for some.
-- Tim Starling
I don't think it wouldn't be a problem for MediaWiki if we wanted to go this route. There could be eg. http://upload.en.wikipedia.org/ which redirected all wiki pages but Special:Upload to http://en.wikipedia.org/
The "normal" Special:Upload would need a redirect there, for accesses not going via $wgUploadNagivationUrl, but that's a couple of lines.
Having the normal apaches handle uploads instead of a dedicated pool has some issues, including the DoS you mention, filled /tmp/s, needing write access to storage via nfs...
On 07/20/2010 10:24 PM, Tim Starling wrote:
The problem is just that increasing the limits in our main Squid and Apache pool would create DoS vulnerabilities, including the prospect of "accidental DoS". We could offer this service via another domain name, with a specially-configured webserver, and a higher level of access control compared to ordinary upload to avoid DoS, but there is no support for that in MediaWiki.
We could theoretically allow uploads of several gigabytes this way, which is about as large as we want files to be anyway. People with flaky internet connections would hit the problem of the lack of resuming, but it would work for some.
yes in theory we could do that ... or we could support some simple chunk uploading protocol for which there is *already* basic support written, and will be supported in native js over time.
The firefogg protocol is almost identical to the plupload protocol. The main difference is firefogg requests a unique upload parameter / url back from the server so that if you uploaded identical named files they would not mangle the chunking. From a quick look at upload.php of plupload it appears plupload relies on the filename and a extra "chunk" url parameter != 0 request parameter. The other difference is firefogg has an explicit done = 1 in the request parameter to signify the end of chunks.
We requested feedback for adding a chunk id to the firefogg chunk protocol with each posted chunk to gard againt cases where the outer caches report an error but the backend got the file anyway. This way the backend can check the chunk index and not append the same chunk twice even if their are errors at other levels of the server response that cause the client to resend the same chunk.
Either way, if Tim says that plupload chunk protocol is "superior" then why discuss it? We can easily shift the chunks api to that and *move forward* with supporting larger file uploads. Is that at all agreeable?
peace, --michael
wikitech-l@lists.wikimedia.org