https://bugzilla.wikimedia.org/show_bug.cgi?id=73661
Bug ID: 73661 Summary: Uploads don't allow non-ASCII characters in filename Product: Pywikibot Version: core-(2.0) Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: CommodoreFabianus@gmx.de Web browser: --- Mobile Platform: ---
Depending on the used version either the original file may not contain non-ASCII characters or the target page name on the wiki. This was changed in Ib751ee3f4074a60f3b53b0afe3cc2dfc3e17b2f7 in pwb 2.0 so versions prior to that won't work with non-ASCII local filenames and versions with that won't work with non-ASCII wiki page names.
The problem is simply that the 'filename'-value in the header of the file/chunk entry (not to be confused with the 'filename' entry in the MIME request). For example:
Content-Type: image/jpeg MIME-Version: 1.0 Content-disposition: form-data; name="file"; filename*=utf-8''%C3%9C.jpg Content-Transfer-Encoding: binary
[… binary data …]
This would be the RFC2231 compliant encoding of a non-ASCII character, which would be used by default in Python 3. Python 2 instead does a strange encoding of the complete line (this may not represent the same text as above but similar):
Content-disposition: =?utf-8?b?Zm9ybS1kYXRhOyBuYW1lPSJmaWxlIjsgZmlsZW5hbWU9?= =?utf-8?b?IsOcMi5qcGci?=
Both are not accepted by the MediaWiki server and are answered with:
badupload_file: File upload param file is not a file upload; be sure to use multipart/form-data for your POST and include a filename in the Content-Disposition header.
Or Python 2:
missingparam: One of the parameters filekey, file, url, statuskey is required
It is possible to leave it UTF8 encoded although that is (afaics) not compliant with the RFCs related to MIME which say that the header may only contain US-ASCII characters.
Unfortunately I'm not sure what mediawiki does with this so I don't if there is a better way, especially as Python 3 doesn't support 'bytes' in the header and otherwise it's not possible to get the value not reencoded there.