Just curious, are formats json, jsonfm, and rawfm not supposed to pass UTF-8 unescaped, and format wddx supposed to turn it into some other format?
for f in json jsonfm php phpfm wddx wddxfm xml xmlfm yaml yamlfm rawfm txt txtfm dbg dbgfm do printf "$f\t" GET -P "http://radioscanningtw.jidanni.org/api.php?action=query&list=allpages&am... perl -pwe 's/[[:ascii:]]//g'|wc -c done|sort -k 2
json 0 #no UTF-8, all got escaped jsonfm 0 rawfm 0 dbg 24 #good dbgfm 24 php 24 phpfm 24 txt 24 txtfm 24 wddxfm 24 xml 24 xmlfm 24 yaml 24 yamlfm 24 wddx 48 #weird, mangled?
On Mon, Mar 23, 2009 at 09:31:47AM +0800, jidanni@jidanni.org wrote:
Just curious, are formats json, jsonfm, and rawfm not supposed to pass UTF-8 unescaped, and format wddx supposed to turn it into some other format?
The PHP json_encode function and the fallback code (in case PHP's json_encode is unavailable or buggy) both automatically escape non-ASCII unicode characters. This does make the response slightly larger, but doesn't hurt anything and can help clients with b0rken utf8 support.
PHP's wddx_serialize_value seems to have a bug in some 5.2.x and 5.3.x versions of PHP: it treats the input as iso-8859-1 and converts it to utf8, so actual utf8 input gets "double"-utf8-encoded.[1] See http://bugs.php.net/bug.php?id=45314 for more info. I guess a bugcheck (like the one we added for JSON and PHP bug 46944) would be in order for the WDDX formatter.
Speaking of which, I just noticed that our JSON bugcheck is wrong: the correct output is '"\ud840\udc00"', not '\ud840\udc00'.
[1] For example, the character U+00A0 is represented by the two bytes c2 a0 in utf8. The buggy wddx_serialize_value interprets that as two iso-8859-1 characters, and converts each to utf8: c3 82 c2 a0
Brad Jorsch schreef:
On Mon, Mar 23, 2009 at 09:31:47AM +0800, jidanni@jidanni.org wrote:
Just curious, are formats json, jsonfm, and rawfm not supposed to pass UTF-8 unescaped, and format wddx supposed to turn it into some other format?
The PHP json_encode function and the fallback code (in case PHP's json_encode is unavailable or buggy) both automatically escape non-ASCII unicode characters. This does make the response slightly larger, but doesn't hurt anything and can help clients with b0rken utf8 support.
PHP's wddx_serialize_value seems to have a bug in some 5.2.x and 5.3.x versions of PHP: it treats the input as iso-8859-1 and converts it to utf8, so actual utf8 input gets "double"-utf8-encoded.[1] See http://bugs.php.net/bug.php?id=45314 for more info. I guess a bugcheck (like the one we added for JSON and PHP bug 46944) would be in order for the WDDX formatter.
Added one in r48713.
Speaking of which, I just noticed that our JSON bugcheck is wrong: the correct output is '"\ud840\udc00"', not '\ud840\udc00'.
Yeah, that's stupid, good catch. Fortunately, this typo didn't cause any misformatting because it never used PHP's formatter.
[1] For example, the character U+00A0 is represented by the two bytes c2 a0 in utf8. The buggy wddx_serialize_value interprets that as two iso-8859-1 characters, and converts each to utf8: c3 82 c2 a0
I used this exact example to test the WDDX formatter.
Roan Kattouw (Catrope)
mediawiki-api@lists.wikimedia.org