On 03/10/2013 12:19 AM, Victor Vasiliev wrote:
After recent discussion on this list I realized that this has been in discussion for as long as four years I went WTF and decided to Just Go Ahead and Fix It. As a result, I made a patch to MediaWiki which allows it to output recent changes feed in JSON: https://gerrit.wikimedia.org/r/#/c/52922/
Also, I wrote a daemon which captures this feed and serves them through WebSockets and simple text-oriented protocol [...] : https://github.com/wikimedia/mediawiki-rcsub
This daemon is written in Python using Twisted and Autobahn and it takes ~200 lines of code (initial version took ~80).
One thing you should consider is whether to escape non-ASCII characters (characters above U+007F) or to encode them using UTF-8.
Python's json.dumps() escapes these characters by default (ensure_ascii = True). If you don't want them escaped (as hex-encoded UTF-16 code units), it's best to decide now, before clients with broken UTF-8 support come into use.
I recently made a [patch][1] (not yet merged) that would add an opt-in "UTF8_OK" feature to FormatJson::encode(). The new option would unescape everything above U+007F (except for U+2028 and U+2029, for compatibility with JavaScript eval() based parsing).
I hope that now getting recent changes via reasonable format is a matter of code review and deployment, and we will finally get something reasonable to work with (with access from web browsers!).
I don't consider encoding "撤销由158.64.77.102于2013年1月22日 (二) 16:46的版本24659468中的繁简破坏" (90 bytes using UTF-8) as
"\u64a4\u9500\u7531158.64.77.102\u4e8e2013\u5e741\u670822\u65e5 (\u4e8c) 16:46\u7684\u7248\u672c24659468\u4e2d\u7684\u7e41\u7b80\u7834\u574f" (141 bytes)
to be reasonable at all for a brand-new protocol running over an 8-bit clean channel.
[1]: https://gerrit.wikimedia.org/r/#/c/50140/