On 03/10/2013 06:27 PM, Victor Vasiliev wrote:
On 03/10/2013 06:30 AM, Kevin Israel wrote:
On 03/10/2013 12:19 AM, Victor Vasiliev wrote:
One thing you should consider is whether to escape non-ASCII
characters (characters above U+007F) or to encode them using UTF-8.
"Whatever the JSON encoder we use does".
Python's json.dumps() escapes these
characters by default
(ensure_ascii = True). If you don't want them escaped (as hex-encoded
UTF-16 code units), it's best to decide now, before clients with
broken UTF-8 support come into use.
As long as it does not add newlines, this is perfectly fine protocol-wise.
If "Whatever the JSON encoder we use does" means that one day, the
daemon starts sending UTF-8 encoded characters, it is quite possible
that existing clients will break because of previously unnoticed
encoding bugs. So I would like to see some formal documentation of the
protocol.
I recently
made a [patch][1] (not yet merged) that would add an opt-in
"UTF8_OK" feature to FormatJson::encode(). The new option would
unescape everything above U+007F (except for U+2028 and U+2029, for
compatibility with JavaScript eval() based parsing).
The part between MediaWiki and the daemon does not matter that much
(except for hitting the size limit on packets, and even then we are on
WMF's internal network, so we should not expect any packet loss and
problems with fragmentation). The daemon extracts the wiki name from the
JSON it received, so it reencodes the change anyways in the middle.
It's good to know that it's quite easy to change the format of the
internal UDP packets without breaking existing clients -- that it's
possible to start using UTF-8 on the UDP side if necessary.
--
Wikipedia user PleaseStand
http://en.wikipedia.org/wiki/User:PleaseStand