Message: 5 Date: Wed, 24 Nov 2010 15:46:24 -0800 From: Erik Moeller erik@wikimedia.org Subject: Re: [Wikitech-l] Commons ZIP file upload for admins To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: AANLkTimD7kXngs4azgPanR_84Ok_th9T1DsANc7stkSh@mail.gmail.com Content-Type: text/plain; charset=ISO-8859-1
[Kicking this thread back to life, full-quoting below only for quick reference.]
I've collected some additional notes on this here: http://commons.wikimedia.org/wiki/Commons:Restricted_uploads
Would appreciate feedback & will circulate further in the Commons community.
Thanks, Erik
Personally I think it would be nicer if you could associate source files with the final files. Something like: *User uploads jpeg of 3D image (or whatever) *on the image description page for the jpg, there is an upload "source" file link *Users (who have appropriate permissions) can upload the associated source files with this link. *These source files might appear as a subpage of the primary image/document/media, or they might just appear in list form at the bottom of the image description page of the main image/media. Either way, the source files would be associated with a single "main" file.
Doing it this way would limit the feature to source files of actually uploaded files (so less random cruft lying around, no orphaned source files, less chance of people abusing the feature to get around file type restrictions). I also personally don't like the idea of uploading archives. Instead I think it would be better just to upload all the source files needed. (although that might fall apart if you're uploading source files for something very complex which has many source files in a specific directory structure). There could also be a download all option where all the source files get tar'ed together on the server side for an easy download.
-bawolff
2010/11/25 bawolff bawolff+wn@gmail.com:
Personally I think it would be nicer if you could associate source files with the final files.
Yeah, this was discussed a bit earlier in this thread. As far as I can tell, that approach adds a fair degree of complexity (requirement of tracking a whole new class of files in association with other files, including versioning, deletion, etc.). It also seems to presume that you'd never want to reference those same files using standard MediaWiki links. It's not clear to me that such a system has clear advantages over using normal wiki-links to source files from appropriate places.
Stepping back a bit, I did a bit more research over the weekend as to the current state of sourcing in Wikimedia Commons, and which file types would be the most important to support.
Generally speaking, there's an existing (albeit limited) practice of adding sources that can be represented as simple plain-text files, such as POV-Ray, Gnuplot, etc. Sometimes these are formatted using the syntax-highlighting extension, sometimes not. This practice could be made more formal by directly requesting that users add source data when they specify that a file has been created using one of these applications (which is often identified using "Created with" templates). But I don't necessarily see that any additional software support is needed for these formats, save perhaps easier downloadability, which could be added to the syntax-highlighting extension.
For binary formats (and perhaps complex XML-based formats), the following stand out as being of high significance:
* .blend as Blender's native export format and COLLADA as an open interchange format * .xcf as Gimp's native format (preserving layers and other meta-information for bitmap images) * .scribus as Scribus' native format (XML, but files can get very large + have dependencies) * .odt, .odp, .od as OpenDocument formats * potentially OpenEXR and some other open interchange formats.
As far as I understand the pure security (as opposed to content) concerns, these fall primarily into these categories:
* client-side execution of unsafe formats using designated applications (embedded macros, references to other malicious content etc.) * exploitation of browser in-line display for purposes of XSS attacks or similar
Let me know if I'm missing a large category. I'm assuming server-side execution is not an issue for Wikimedia given correct server configuration.
Full security for these and other conceivably useful binary formats seems difficult to obtain to me (that is, making sure that nothing bad ever runs on a user's computer if they open a file). The restricted upload (or restricted attachment) approach builds on social trust to complement technical verification methods. We'd still have to invent some additional machinery to implement security warnings before ever exposing such files directly to the user.
Sacrificing easy individual file manageability, I wonder if it wouldn't be most straightforward to write a decent ZIP handler (with directory display, and thumbnailing of included images, for purposes of patrolling), to disallow ZIP files that contain non-whitelisted filetypes, and to use ZIPs as the container for all complex, free-format source uploads. [[File:Bla source.zip]] could then just be referenced as part of the file description pages where relevant. Because some of the aforementioned binary formats are effectively archives, some of this work would likely be necessary anyway.
That said, I'm not wedded to any particular approach. I hope we can identify reasonably simple steps that we can take to significantly expand our support for source files in the near term, because such files are essential for re-use.
2010/11/29 Erik Moeller erik@wikimedia.org:
As far as I understand the pure security (as opposed to content) concerns, these fall primarily into these categories:
- client-side execution of unsafe formats using designated
applications (embedded macros, references to other malicious content etc.)
- exploitation of browser in-line display for purposes of XSS attacks or similar
Let me know if I'm missing a large category. I'm assuming server-side execution is not an issue for Wikimedia given correct server configuration.
Server side execution is not an issue, no.
The client-side issues can all be reduced to a file acting as type A to MediaWiki and as type B to the victim, where A is some harmless file type we'd like to allow users to upload and B is some potentially dangerous file type. This is usually enabled by one or more of the following factors: * IE second-guesses the server-provided MIME type in favor of its own brain-dead MIME type detection algorithm, which in particular is extremely eager to treat things as HTML (causing any embedded JS to be executed): the presence of certain HTML tags or tag-like strings in the first 255 bytes is sufficient reason for IE to call something HTML * File formats are often interpreted flexibly, so a file that doesn't conform to the standard completely may be read just fine by most applications. These flexibilities allow for creating a file that looks like an A but also comes close enough to being a B. For example, running an HTML page containing unified diff text in the middle through patch(1) will usually work, because patch(1) discards "garbage" before and after the diff. These flexibilities are usually undocumented and vary between applications, so it can be difficult to predict whether a file qualifies as "almost a B" * Some file formats are designed in such a way that a file can actually be a completely valid A *and* a completely valid B all at the same time. This is the case for most ZIP and ZIP-like formats
To illustrate the last sentence of the second bullet point, I'll quote Tim's blog post on upload security [1] (which is a fun read for anyone even mildly interested in the topic). It's part of the section on the GIFAR vulnerability, which involves a file that's a valid GIF or ZIP file, but which Java happily executes as a JAR (a ZIP-like format for executable Java bytecode) file because Java's JAR format validation is extremely lax, almost nonexistent. The only validation is does do is check for a certain magic number at the end of the file, so rejecting
"An alternative [to rejecting all ZIP files] would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can’t vouch for this method. **If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java.** It would probably be best to actually shell out to Java to do the test."
(emphasis mine)
Roan Kattouw (Catrope)
Roan Kattouw wrote:
"An alternative [to rejecting all ZIP files] would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can’t vouch for this method. **If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java.** It would probably be best to actually shell out to Java to do the test."
(emphasis mine)
If we consider acceptable the perfomance of parsing full zip files (as opposed to just 512 bytes or the central directory), we can quite easily accept many zip files.
There's also the issue of jar protocol, but that seems fixed from Firefox 2.0.0.10 so probably not worth taking into account. http://kb.mozillazine.org/Network.jar.open-unsafe-types
On Mon, Nov 29, 2010 at 9:29 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
"An alternative [to rejecting all ZIP files] would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can’t vouch for this method. **If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java.** It would probably be best to actually shell out to Java to do the test."
I was thinking about this. There appears to be no option to the java command line client to only check a file without executing. An option would be to invoke the java debugger (jdb), which initially breaks at the first instruction and presumably fails if the file is not a valid jar. Still sounds nasty though, plus the fact that jdb is not a generally installed program.
Bryan
Bryan Tong Minh wrote:
On Mon, Nov 29, 2010 at 9:29 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
"An alternative [to rejecting all ZIP files] would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can’t vouch for this method. **If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java.** It would probably be best to actually shell out to Java to do the test."
I was thinking about this. There appears to be no option to the java command line client to only check a file without executing. An option would be to invoke the java debugger (jdb), which initially breaks at the first instruction and presumably fails if the file is not a valid jar. Still sounds nasty though, plus the fact that jdb is not a generally installed program.
Bryan
Note that you can't simply check (or reverse-engineer) that JVM X doesn't treat it as a jar, since it could be detected in X-1 or X+1. So there should be a range of still in use JVMs to assert.
On Mon, Nov 29, 2010 at 11:10 PM, Platonides Platonides@gmail.com wrote:
Note that you can't simply check (or reverse-engineer) that JVM X doesn't treat it as a jar, since it could be detected in X-1 or X+1. So there should be a range of still in use JVMs to assert.
I run my own IT support company, and I've seen both private and company clients running three-year-old Java and Flash versions, of course the machines had a load of malware on them (which was the reason I got called). The problem is, you've got a lot of users out there who are confused by the update messages or by the Windows UAC launching with every update as they get a LOT of lookalike messages from sites like kino.to and now are confused what is real and what not. Securing against the "most in use JVM/PDF/Flash/whatever" version is pointless, as you have to cover around three years of version histories, if not more. For OpenOffice clients, it's even worse, as some companies introduce their own private patch sets. Haven't seen this until now, but I've never been at really big companies where this actually is likely to happen.
Marco
On Mon, Nov 29, 2010 at 11:10 PM, Platonides Platonides@gmail.com wrote:
Bryan Tong Minh wrote:
On Mon, Nov 29, 2010 at 9:29 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
"An alternative [to rejecting all ZIP files] would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can’t vouch for this method. **If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java.** It would probably be best to actually shell out to Java to do the test."
I was thinking about this. There appears to be no option to the java command line client to only check a file without executing. An option would be to invoke the java debugger (jdb), which initially breaks at the first instruction and presumably fails if the file is not a valid jar. Still sounds nasty though, plus the fact that jdb is not a generally installed program.
Bryan
Note that you can't simply check (or reverse-engineer) that JVM X doesn't treat it as a jar, since it could be detected in X-1 or X+1. So there should be a range of still in use JVMs to assert.
I think that the most recent version should be sufficient. I don't think Java would break backwards compatibility: users wouldn't be happy if their old jar suddenly stops working on a new JVM.
Bryan
* Bryan Tong Minh bryan.tongminh@gmail.com [Tue, 30 Nov 2010 08:44:43 +0100]:
I think that the most recent version should be sufficient. I don't think Java would break backwards compatibility: users wouldn't be happy if their old jar suddenly stops working on a new JVM.
Why an outdated and inefficient ZIP format, after all? 7zip is incompatible to JVM, should it be a better choice for archive uploads? Or, that is too hard to parse on PHP side (I gueses console exec is required)? Dmitriy
On Tue, Nov 30, 2010 at 8:48 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
- Bryan Tong Minh bryan.tongminh@gmail.com [Tue, 30 Nov 2010 08:44:43
+0100]:
I think that the most recent version should be sufficient. I don't think Java would break backwards compatibility: users wouldn't be happy if their old jar suddenly stops working on a new JVM.
Why an outdated and inefficient ZIP format, after all? 7zip is incompatible to JVM, should it be a better choice for archive uploads? Or, that is too hard to parse on PHP side (I gueses console exec is required)?
You can create a zip easily on all major OSes with drag'n'drop. Windows supports it IIRC from Win 98 SE and up, a standard Linux by the tools the desktop installs (for KDE, it once was Ark), and MacOS also delivers ZIP out of the box. For ZIP, there are even built-in PHP functions to handle it. 7zip is, though open source, requiring third-party plugins, both for the OS and servers, and 7zip is not really widespread. RAR and ZIP are the dominant formats in cross-platform data exchange.
Marco
* Marco Schuster marco@harddisk.is-a-geek.org [Tue, 30 Nov 2010 11:05:09 +0100]:
You can create a zip easily on all major OSes with drag'n'drop. Windows supports it IIRC from Win 98 SE and up, a standard Linux by the tools the desktop installs (for KDE, it once was Ark), and MacOS also delivers ZIP out of the box. For ZIP, there are even built-in PHP functions to handle it. 7zip is, though open source, requiring third-party plugins, both for the OS and servers, and 7zip is not really widespread. RAR and ZIP are the dominant formats in cross-platform data exchange.
There is console version, which might be executed at server side to get contents of archive or to analyze it http://sourceforge.net/projects/p7zip/ MediaWiki already relies on running external executables such as convert (ImageMagik) and texvc. I should admit that using ImageMagik for image resamping is faster, takes less RAM and gives better results than PHP built-in image handling modules (although ImageMagik should also be available as PHP module, however not everywhere and increases footprint a little bit). Dmitriy
* Marco Schuster marco@harddisk.is-a-geek.org [Tue, 30 Nov 2010 11:05:09 +0100]:
You can create a zip easily on all major OSes with drag'n'drop. Windows supports it IIRC from Win 98 SE and up, a standard Linux by the tools the desktop installs (for KDE, it once was Ark), and MacOS also delivers ZIP out of the box. For ZIP, there are even built-in PHP functions to handle it. 7zip is, though open source, requiring third-party plugins, both for the OS and servers, and 7zip is not really widespread. RAR and ZIP are the dominant formats in cross-platform data exchange.
Also, I remember seeing 7z streams recently implemented in 1.17, somewhere, already (with external piping, probably)..
2010/11/30 Dmitriy Sintsov questpc@rambler.ru:
- Bryan Tong Minh bryan.tongminh@gmail.com [Tue, 30 Nov 2010 08:44:43
+0100]:
I think that the most recent version should be sufficient. I don't think Java would break backwards compatibility: users wouldn't be happy if their old jar suddenly stops working on a new JVM.
Why an outdated and inefficient ZIP format, after all? 7zip is incompatible to JVM, should it be a better choice for archive uploads? Or, that is too hard to parse on PHP side (I gueses console exec is required)?
We don't necessarily want ZIP uploads at Wikimedia, but it's not unreasonable to want to upload OpenOffice documents. Since the OO formats are ZIP-like, blocking ZIPs blocks those too.
Roan Kattouw (Catrope)
On Tue, Nov 30, 2010 at 9:40 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
We don't necessarily want ZIP uploads at Wikimedia, but it's not unreasonable to want to upload OpenOffice documents. Since the OO formats are ZIP-like, blocking ZIPs blocks those too.
Roan Kattouw (Catrope)
Although this feature(/s) should they get implemented in code would probably be wanted more than just at WMF and we shouldn't focus discussion on features such as this a Yes or No just because it's something the foundation may or may not want. -Peachey
Bryan Tong Minh wrote:
Note that you can't simply check (or reverse-engineer) that JVM X doesn't treat it as a jar, since it could be detected in X-1 or X+1. So there should be a range of still in use JVMs to assert.
I think that the most recent version should be sufficient. I don't think Java would break backwards compatibility: users wouldn't be happy if their old jar suddenly stops working on a new JVM.
Bryan
Have you seen Conficker's autorun.inf? It's purpusefully mae to look like garbage. It's full of NULs, contain non-printable characters, keys with mixed case...
That's a perfect example where a change would give no backwards compatibility issues. The legit autorun.inf, made as plain ini files won't break if "icon\01\15=" is no longer recognised as the "icon" key.
Good jars wouldn't be affected if eg. Java 3 accepted central directories pointing anywhere and now they are required to point to a zip entry.
wikitech-l@lists.wikimedia.org