Live recent changes feed

List overview All Threads
Download

newer

older

Submit a talk to Open Source...

40 million links to Wikipedia

Victor Vasiliev

10 Mar 2013 10 Mar '13

12:19 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi everybody,

For long time it was acknowledged that our current way of serving the recent changes feed to users (IRC with formatting using funny control codes) is one of the worst-suited for this purpose. It made the life miserable both for users who had to parse it (since nobody is actually reading it from IRC) and for developers who had to fit that thing into IRC line length limit. Time passed, and many ways were suggested to fix this (including https://meta.wikimedia.org/wiki/Recentchanges_via_XMPP and https://www.mediawiki.org/wiki/Requests_for_comment/Structured_data_push_notification_support_for_recent_changes), but nobody actually went ahead and made it work.

After recent discussion on this list I realized that this has been in discussion for as long as four years I went WTF and decided to Just Go Ahead and Fix It. As a result, I made a patch to MediaWiki which allows it to output recent changes feed in JSON: https://gerrit.wikimedia.org/r/#/c/52922/

Also, I wrote a daemon which captures this feed and serves them through WebSockets and simple text-oriented protocol which serves same JSON without WebSocket wrapping (for poor souls writing in languages without proper WebSocket support): https://github.com/wikimedia/mediawiki-rcsub

This daemon is written in Python using Twisted and Autobahn and it takes ~200 lines of code (initial version took ~80).

As a bonus, this involves no XML streaming in any form (unlike XMPP or PubSubHubbub), so the unicorns are happy and unharmed, and minds of programmers implementing this will remain unfried.

I hope that now getting recent changes via reasonable format is a matter of code review and deployment, and we will finally get something reasonable to work with (with access from web browsers!).

- -- Victor.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQIcBAEBAgAGBQJRPBffAAoJEHEOTaoYvDHXUCMP/jml/EGAxXLuz1sGrS5R0iRF EJCjUKkysl1Gw0Wmr597UETtF1BCHh1myicGBN6tEjEd4N9rkNC8embBIdMjnlNN KFfJeg4cSMhfIprjFQHdYjy3jw6mK1Kr87jc/KIWkDdWwoV5EmcbQ/cGc/UQrcd2 9cVmc3qUXWEf/oxhv3nGTfeW6gJDRZshpB66+YNr5LzAaBhroastW1r0b8UDXZt9 3u1BOr9lcHbi62DLqPOCH+aXljOidrjoWff+cV9CzUS9M4axcHThzu4Eo1s7EpgX iWPVTuk3By3/EPxk9gJPETl7oPET6qNvNkUzix9Enu3iGuaWwEcano8xgFIfAWp8 /Prf00xIe6VjMWssb3M+G9OkaclDBTPnMs9WxYMGHui8SZT62zQowJKeF+HrphjA A/rrpHEfQz4TlutrvtPthSKTAICzuXDcnXLUxIHhvJfVF6iq57ntA8iJ2vrrqQge ISOIZRgfDNQFb1UOER4P5VsXN1fKaP72OCSbP9smlVOtWgoCz0IqifdFSvc/Wo/O Fj5cafbPPB8R0AqMb29bnv89u6SvVCh5Y3v9pK5523xo0LVP+WGXe+WNuxW9jjeZ +y/d3EQTjl40pP/MzsBxR+BCz+Q84myjmpO0FvmPPxqxnA2bz0dSyfYyZlIIu7Mj zesgY0TGThmu12q0Y068 =oGgQ -----END PGP SIGNATURE-----

Show replies by date

Liangent

10 Mar 10 Mar

2:54 p.m.

On Sun, Mar 10, 2013 at 1:19 PM, Victor Vasiliev vasilvv@gmail.com wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi everybody,

For long time it was acknowledged that our current way of serving the recent changes feed to users (IRC with formatting using funny control codes) is one of the worst-suited for this purpose. It made the life miserable both for users who had to parse it (since nobody is actually reading it from IRC)

Note that some people *are* actually reading it from IRC. For example, I read it when I'm starting a bot run, if the bot is made quickly and doesn't output much info.

-Liangent

...

and for developers who had to fit that thing into IRC line length limit. Time passed, and many ways were suggested to fix this (including https://meta.wikimedia.org/wiki/Recentchanges_via_XMPP and https://www.mediawiki.org/wiki/Requests_for_comment/Structured_data_push_notification_support_for_recent_changes), but nobody actually went ahead and made it work.

After recent discussion on this list I realized that this has been in discussion for as long as four years I went WTF and decided to Just Go Ahead and Fix It. As a result, I made a patch to MediaWiki which allows it to output recent changes feed in JSON: https://gerrit.wikimedia.org/r/#/c/52922/

Also, I wrote a daemon which captures this feed and serves them through WebSockets and simple text-oriented protocol which serves same JSON without WebSocket wrapping (for poor souls writing in languages without proper WebSocket support): https://github.com/wikimedia/mediawiki-rcsub

This daemon is written in Python using Twisted and Autobahn and it takes ~200 lines of code (initial version took ~80).

As a bonus, this involves no XML streaming in any form (unlike XMPP or PubSubHubbub), so the unicorns are happy and unharmed, and minds of programmers implementing this will remain unfried.

I hope that now getting recent changes via reasonable format is a matter of code review and deployment, and we will finally get something reasonable to work with (with access from web browsers!).

-- Victor.

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRPBffAAoJEHEOTaoYvDHXUCMP/jml/EGAxXLuz1sGrS5R0iRF EJCjUKkysl1Gw0Wmr597UETtF1BCHh1myicGBN6tEjEd4N9rkNC8embBIdMjnlNN KFfJeg4cSMhfIprjFQHdYjy3jw6mK1Kr87jc/KIWkDdWwoV5EmcbQ/cGc/UQrcd2 9cVmc3qUXWEf/oxhv3nGTfeW6gJDRZshpB66+YNr5LzAaBhroastW1r0b8UDXZt9 3u1BOr9lcHbi62DLqPOCH+aXljOidrjoWff+cV9CzUS9M4axcHThzu4Eo1s7EpgX iWPVTuk3By3/EPxk9gJPETl7oPET6qNvNkUzix9Enu3iGuaWwEcano8xgFIfAWp8 /Prf00xIe6VjMWssb3M+G9OkaclDBTPnMs9WxYMGHui8SZT62zQowJKeF+HrphjA A/rrpHEfQz4TlutrvtPthSKTAICzuXDcnXLUxIHhvJfVF6iq57ntA8iJ2vrrqQge ISOIZRgfDNQFb1UOER4P5VsXN1fKaP72OCSbP9smlVOtWgoCz0IqifdFSvc/Wo/O Fj5cafbPPB8R0AqMb29bnv89u6SvVCh5Y3v9pK5523xo0LVP+WGXe+WNuxW9jjeZ +y/d3EQTjl40pP/MzsBxR+BCz+Q84myjmpO0FvmPPxqxnA2bz0dSyfYyZlIIu7Mj zesgY0TGThmu12q0Y068 =oGgQ -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Kevin Israel

5:30 p.m.

On 03/10/2013 12:19 AM, Victor Vasiliev wrote:

...

After recent discussion on this list I realized that this has been in discussion for as long as four years I went WTF and decided to Just Go Ahead and Fix It. As a result, I made a patch to MediaWiki which allows it to output recent changes feed in JSON: https://gerrit.wikimedia.org/r/#/c/52922/

Also, I wrote a daemon which captures this feed and serves them through WebSockets and simple text-oriented protocol [...] : https://github.com/wikimedia/mediawiki-rcsub

This daemon is written in Python using Twisted and Autobahn and it takes ~200 lines of code (initial version took ~80).

One thing you should consider is whether to escape non-ASCII characters (characters above U+007F) or to encode them using UTF-8.

Python's json.dumps() escapes these characters by default (ensure_ascii = True). If you don't want them escaped (as hex-encoded UTF-16 code units), it's best to decide now, before clients with broken UTF-8 support come into use.

I recently made a [patch][1] (not yet merged) that would add an opt-in "UTF8_OK" feature to FormatJson::encode(). The new option would unescape everything above U+007F (except for U+2028 and U+2029, for compatibility with JavaScript eval() based parsing).

...

I hope that now getting recent changes via reasonable format is a matter of code review and deployment, and we will finally get something reasonable to work with (with access from web browsers!).

I don't consider encoding "撤销由158.64.77.102于2013年1月22日 (二) 16:46的版本24659468中的繁简破坏" (90 bytes using UTF-8) as

"\u64a4\u9500\u7531158.64.77.102\u4e8e2013\u5e741\u670822\u65e5 (\u4e8c) 16:46\u7684\u7248\u672c24659468\u4e2d\u7684\u7e41\u7b80\u7834\u574f" (141 bytes)

to be reasonable at all for a brand-new protocol running over an 8-bit clean channel.

[1]: https://gerrit.wikimedia.org/r/#/c/50140/

-- Wikipedia user PleaseStand http://en.wikipedia.org/wiki/User:PleaseStand

Victor Vasiliev

11 Mar 11 Mar

5:27 a.m.

On 03/10/2013 06:30 AM, Kevin Israel wrote:

...

On 03/10/2013 12:19 AM, Victor Vasiliev wrote: One thing you should consider is whether to escape non-ASCII characters (characters above U+007F) or to encode them using UTF-8.

"Whatever the JSON encoder we use does".

...

Python's json.dumps() escapes these characters by default (ensure_ascii = True). If you don't want them escaped (as hex-encoded UTF-16 code units), it's best to decide now, before clients with broken UTF-8 support come into use.

As long as it does not add newlines, this is perfectly fine protocol-wise.

...

I recently made a [patch][1] (not yet merged) that would add an opt-in "UTF8_OK" feature to FormatJson::encode(). The new option would unescape everything above U+007F (except for U+2028 and U+2029, for compatibility with JavaScript eval() based parsing).

The part between MediaWiki and the daemon does not matter that much (except for hitting the size limit on packets, and even then we are on WMF's internal network, so we should not expect any packet loss and problems with fragmentation). The daemon extracts the wiki name from the JSON it received, so it reencodes the change anyways in the middle.

...

...
I hope that now getting recent changes via reasonable format is a matter of code review and deployment, and we will finally get something reasonable to work with (with access from web browsers!).

I don't consider encoding "撤销由158.64.77.102于2013年1月22日 (二) 16:46的版本24659468中的繁简破坏" (90 bytes using UTF-8) as

"\u64a4\u9500\u7531158.64.77.102\u4e8e2013\u5e741\u670822\u65e5 (\u4e8c) 16:46\u7684\u7248\u672c24659468\u4e2d\u7684\u7e41\u7b80\u7834\u574f" (141 bytes)

to be reasonable at all for a brand-new protocol running over an 8-bit clean channel.

That's your bikeshed, not mine.

-- Victor.

Kevin Israel

6:11 a.m.

On 03/10/2013 06:27 PM, Victor Vasiliev wrote:

...

On 03/10/2013 06:30 AM, Kevin Israel wrote:

...
On 03/10/2013 12:19 AM, Victor Vasiliev wrote: One thing you should consider is whether to escape non-ASCII characters (characters above U+007F) or to encode them using UTF-8.

"Whatever the JSON encoder we use does".

...
Python's json.dumps() escapes these characters by default (ensure_ascii = True). If you don't want them escaped (as hex-encoded UTF-16 code units), it's best to decide now, before clients with broken UTF-8 support come into use.

As long as it does not add newlines, this is perfectly fine protocol-wise.

If "Whatever the JSON encoder we use does" means that one day, the daemon starts sending UTF-8 encoded characters, it is quite possible that existing clients will break because of previously unnoticed encoding bugs. So I would like to see some formal documentation of the protocol.

...

...
I recently made a [patch][1] (not yet merged) that would add an opt-in "UTF8_OK" feature to FormatJson::encode(). The new option would unescape everything above U+007F (except for U+2028 and U+2029, for compatibility with JavaScript eval() based parsing).

The part between MediaWiki and the daemon does not matter that much (except for hitting the size limit on packets, and even then we are on WMF's internal network, so we should not expect any packet loss and problems with fragmentation). The daemon extracts the wiki name from the JSON it received, so it reencodes the change anyways in the middle.

It's good to know that it's quite easy to change the format of the internal UDP packets without breaking existing clients -- that it's possible to start using UTF-8 on the UDP side if necessary.

-- Wikipedia user PleaseStand http://en.wikipedia.org/wiki/User:PleaseStand

Bartosz Dziewoński

6:32 a.m.

On Mon, 11 Mar 2013 00:11:59 +0100, Kevin Israel pleasestand@live.com wrote:

...

If "Whatever the JSON encoder we use does" means that one day, the daemon starts sending UTF-8 encoded characters, it is quite possible that existing clients will break because of previously unnoticed encoding bugs. So I would like to see some formal documentation of the protocol.

It's 2013. If something still doesn't support receiving UTF-8 data and sending it back without corrupting the text, it should be chucked out of the window like now.

And I don't mean things like properly determining the length of a string etc., as these are not UTF-8 specific, and *are* hard to get right; I mean not breaking binary data.

-- Matma Rex

Petr Bena

6:34 a.m.

I appreciate someone does something, but this should have been more discussed. I would like to highlight that our goal should NOT be to do this a way that is most simple for developers of mediawiki to implement and most simple for devops to maintain and setup. Our goal should be to make this feed most simple to implement into target application (bot, tool) for the developers of that tool. The ideal feed should be pretty simple to be parseable by something as trivial as a shell script with netcat or telnet on remote server (absolutely no need to use some 3rd party libraries). I am fine with using JSON as one option, but if it's the only option this new feed is supposed to provide, it will be very hard to implement in some tools. Basically anything what will require some extra libraries will make it harder than it actually is - despite it could be more flexible and faster.

On Mon, Mar 11, 2013 at 12:11 AM, Kevin Israel pleasestand@live.com wrote:

...

On 03/10/2013 06:27 PM, Victor Vasiliev wrote:

...
On 03/10/2013 06:30 AM, Kevin Israel wrote:

...
On 03/10/2013 12:19 AM, Victor Vasiliev wrote: One thing you should consider is whether to escape non-ASCII characters (characters above U+007F) or to encode them using UTF-8.

"Whatever the JSON encoder we use does".

...
Python's json.dumps() escapes these characters by default (ensure_ascii = True). If you don't want them escaped (as hex-encoded UTF-16 code units), it's best to decide now, before clients with broken UTF-8 support come into use.

As long as it does not add newlines, this is perfectly fine protocol-wise.

If "Whatever the JSON encoder we use does" means that one day, the daemon starts sending UTF-8 encoded characters, it is quite possible that existing clients will break because of previously unnoticed encoding bugs. So I would like to see some formal documentation of the protocol.

...
...
I recently made a [patch][1] (not yet merged) that would add an opt-in "UTF8_OK" feature to FormatJson::encode(). The new option would unescape everything above U+007F (except for U+2028 and U+2029, for compatibility with JavaScript eval() based parsing).

The part between MediaWiki and the daemon does not matter that much (except for hitting the size limit on packets, and even then we are on WMF's internal network, so we should not expect any packet loss and problems with fragmentation). The daemon extracts the wiki name from the JSON it received, so it reencodes the change anyways in the middle.

It's good to know that it's quite easy to change the format of the internal UDP packets without breaking existing clients -- that it's possible to start using UTF-8 on the UDP side if necessary.

-- Wikipedia user PleaseStand http://en.wikipedia.org/wiki/User:PleaseStand

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tyler Romeo

6:46 a.m.

On Sun, Mar 10, 2013 at 7:34 PM, Petr Bena benapetr@gmail.com wrote:

...

I appreciate someone does something, but this should have been more discussed. I would like to highlight that our goal should NOT be to do this a way that is most simple for developers of mediawiki to implement and most simple for devops to maintain and setup. Our goal should be to make this feed most simple to implement into target application (bot, tool) for the developers of that tool. The ideal feed should be pretty simple to be parseable by something as trivial as a shell script with netcat or telnet on remote server (absolutely no need to use some 3rd party libraries). I am fine with using JSON as one option, but if it's the only option this new feed is supposed to provide, it will be very hard to implement in some tools. Basically anything what will require some extra libraries will make it harder than it actually is - despite it could be more flexible and faster.

I agree with the discussion, but I think this is a good starting point.

You can expect some patches from me soon. ;)

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Victor Vasiliev

7:13 a.m.

On 03/10/2013 07:34 PM, Petr Bena wrote:

...

I appreciate someone does something, but this should have been more discussed.

Well, we can discuss this now. I don't like the discussions which end up with "here's a design of our superpony which should have those 9000+ features", and then we have no progress. I think it's sort of nice we have a starting point now.

...

I would like to highlight that our goal should NOT be to do this a way that is most simple for developers of mediawiki to implement and most simple for devops to maintain and setup. Our goal should be to make this feed most simple to implement into target application (bot, tool) for the developers of that tool.

There are different type of clients, and they all have their own "easiest format". WebSockets is pretty much the only serious option there, and hence this is something which we really want to support. WebSocket protocol is also very complex, so for non-browser apps I added the second protocol (I am still not sure how good this idea is).

...

The ideal feed should be pretty simple to be parseable by something as trivial as a shell script with netcat or telnet on remote server (absolutely no need to use some 3rd party libraries). I am fine with using JSON as one option, but if it's the only option this new feed is supposed to provide, it will be very hard to implement in some tools.

We want this to be a machine-readable feed. The two most widespread universal formats for transferring machine-readable structure data are XML and JSON. JSON parsing is almost always easier than XML (if it's not, then it's probably the fault of the JSON library in use).

The text-based protocol works with netcat (that's how I tested it). However, it turns out that awk and sed are not well-suited for parsing structured data (have you ever tried to parse XML from a shell script?).

...

Basically anything what will require some extra libraries will make it harder than it actually is - despite it could be more flexible and faster.

I don't think I buy into the statement that lack of built-in JSON parser is an issue with JSON, and not with that language's library structure.

-- Victor.

Tyler Romeo

7:59 a.m.

My main concern with the program in its current state is the lack of sufficient design. I mean, both the Configuration and MessageRouter objects are glorified dictionaries (or defaultdicts), and global variables are used for the router and config.

Also, the config protocol is almost definitely a bad idea. Since it's unauthenticated, the only way to guarantee security is to use a Unix socket (or some other only-locally-accessible method), at which point you already have the means of stopping the server and reading the config. Finally, stats should be fine if publicly available. In other words, the only useful thing the control protocol could be used for is reloading the configuration.

Other than that, minor quirks, such as handleJSONCommand, a protocol function, being put in the Subscriber class.

And, of course, there's the issue of performance. Python doesn't handle threads, and since Twisted isn't multiprocess AFAIK, this might not be able to handle that many connections.

Finally, other than WebSocket and the socket interface, the one other subscription method we should have it some sort of HTTP hook call, i.e., it sends an HTTP request to the subscriber. This allows event-driven clients without having a socket constantly open.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Victor Vasiliev

9:38 a.m.

On 03/10/2013 08:59 PM, Tyler Romeo wrote:

...

My main concern with the program in its current state is the lack of sufficient design. I mean, both the Configuration and MessageRouter objects are glorified dictionaries (or defaultdicts), and global variables are used for the router and config.

That's quite intentional. The intent was to make it as simple as possible.

Anyways, this thing is currently 200 lines of code, so refactoring it is really simple. I guess at some point every simple 200-line Python script has to turn into a beautiful, elegant 600-line Python script.

(or into a 50-line Haskell script where nobody really understands what's going on, but that's the other story)

...

Also, the config protocol is almost definitely a bad idea. Since it's unauthenticated, the only way to guarantee security is to use a Unix socket (or some other only-locally-accessible method), at which point you already have the means of stopping the server and reading the config. Finally, stats should be fine if publicly available. In other words, the only useful thing the control protocol could be used for is reloading the configuration.

Eh... it is a Unix socket. The only actual purpose I added it was to support configuration reloading, because doing that through SIGUSR1 would bring us to the signal minefield.

...

Other than that, minor quirks, such as handleJSONCommand, a protocol function, being put in the Subscriber class.

Well, there are two classes doing almost the same thing, but having incompatible interface due to being derived from different protocol classes. I wanted to fix it first, but that would involve multiple inheritance, so I decided just to offload the feature to Subscriber. Of course, I could have created another class called JSONSubscriber for that.

As usual, patches welcome.

...

And, of course, there's the issue of performance. Python doesn't handle threads, and since Twisted isn't multiprocess AFAIK, this might not be able to handle that many connections.

Well, the issue here is that you essentially have a simple program which takes message from one port and then resends it to many others. Even if threads would be of help here, Python works better with I/O-bound multithreading than with other sorts.

...

Finally, other than WebSocket and the socket interface, the one other subscription method we should have it some sort of HTTP hook call, i.e., it sends an HTTP request to the subscriber. This allows event-driven clients without having a socket constantly open.

I am not sure what exactly do you mean by that.

Thank you for your feedback.

-- Victor.

Tyler Romeo

10:25 a.m.

On Sun, Mar 10, 2013 at 10:38 PM, Victor Vasiliev vasilvv@gmail.com wrote:

...

...
Finally, other than WebSocket and the socket interface, the one other subscription method we should have it some sort of HTTP hook call, i.e., it sends an HTTP request to the subscriber. This allows

event-driven

...
clients without having a socket constantly open.

I am not sure what exactly do you mean by that.

When a message is sent, it is delivered by the daemon submitting an HTTP POST request to a registered client URI. This is a commonly used scheme for push notification delivery, such as when using Amazon's notification service.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Brian Wolff

10:53 a.m.

On 2013-03-11 12:26 AM, "Tyler Romeo" tylerromeo@gmail.com wrote:

...

On Sun, Mar 10, 2013 at 10:38 PM, Victor Vasiliev vasilvv@gmail.com

wrote:

...

...
...
Finally, other than WebSocket and the socket interface, the one other subscription method we should have it some sort of HTTP hook

call,

...

...
...
i.e., it sends an HTTP request to the subscriber. This allows

event-driven

...
clients without having a socket constantly open.

I am not sure what exactly do you mean by that.

When a message is sent, it is delivered by the daemon submitting an HTTP POST request to a registered client URI. This is a commonly used scheme

for

...

push notification delivery, such as when using Amazon's notification service.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wait, so it just sends http post requests to some address until explicitly told to stop? That sounds like an incredibly bad idea (if I understand it correctly)

*if you forget to unsubscribe we send you post requests until the end of eternity. *dos vector - register someone you don't like's url. Register 1000000 variants from the same domain. Push enwikipedia's rc feed there.

In any case, I don't see the need to have every form of push api imaginable implemented. Especially not initially but even in general.

-bawolff

Tyler Romeo

11:11 a.m.

On Sun, Mar 10, 2013 at 11:53 PM, Brian Wolff bawolff@gmail.com wrote:

...

*if you forget to unsubscribe we send you post requests until the end of eternity.

Have it cut off if it receives an invalid HTTP response.

*dos vector - register someone you don't like's url. Register 1000000

...

variants from the same domain. Push enwikipedia's rc feed there.

Or you roll out some EC2 instances and open 1000000 sockets. (And before you say rate-limit based on IP address, the same can be done for the HTTP idea.)

In any case, I don't see the need to have every form of push api imaginable

...

implemented. Especially not initially but even in general.

Agreed, but this is a pretty basic one. In fact, if you use HTTP keep alive, it's almost identical to the TCP push method anyway, just that you can use a web server rather than rolling your own socket client.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Brian Wolff

12 Mar 12 Mar

1:25 a.m.

On 2013-03-11 1:11 AM, "Tyler Romeo" tylerromeo@gmail.com wrote:

...

*dos vector - register someone you don't like's url. Register 1000000

...
variants from the same domain. Push enwikipedia's rc feed there.

Or you roll out some EC2 instances and open 1000000 sockets. (And before you say rate-limit based on IP address, the same can be done for the HTTP idea.)

I mean you could use such a service to DoS somebody else. If you can open sockets, then its your own server.

Sure you could add some mechamism to prove you own the domain where you want the rc updates to be sent, but things can get rather complex.

--bawolff

Jeroen De Dauw

1:46 a.m.

Hey,

Sure you could add some mechamism to prove you own the domain where you

...

want the rc updates to be sent, but things can get rather complex.

Google uses, or at least used to use, the following to do exactly that:

On request provide a auth file to the user which includes some unique identifier. Require this file to be made available via the domain in question. Have the user point to the location where it is made available and check if it is actually there. If so, domain authenticated.

That seems rather simple to create.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --

Brian Wolff

2:12 a.m.

On 2013-03-11 3:46 PM, "Jeroen De Dauw" jeroendedauw@gmail.com wrote:

...

Hey,

Sure you could add some mechamism to prove you own the domain where you

...
want the rc updates to be sent, but things can get rather complex.

Google uses, or at least used to use, the following to do exactly that:

On request provide a auth file to the user which includes some unique identifier. Require this file to be made available via the domain in question. Have the user point to the location where it is made available and check if it is actually there. If so, domain authenticated.

That seems rather simple to create.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. -- _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

I think that proves my point - what you describe is not what google does. Google tells the user the path for the file (i believe the usual place is in the root of the domain). The user does not pick the path. Otherwise I could prove I own wikipedia (assuming mime types weren't checked) by using action=raw.

Things that finiky to be made secure should be avoided imo.

-bawolff

Jeroen De Dauw

2:28 a.m.

Hey,

what you describe is not what google does.

Google tells the user the path for the file (i believe the usual place is

...

in the root of the domain). The user does not pick the path. Otherwise I could prove I own wikipedia (assuming mime types weren't checked) by using action=raw.

Good point, I remembered that wrong then.

I think that proves my point - what you describe is not what google does.

...

https://en.wikipedia.org/wiki/Fallacy_fallacy

The approach still seems simple. In fact, it seems more simple. So why would we not want to use it?

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --

Tyler Romeo

2:31 a.m.

Honestly, the solution could be as simple as requiring that the HTTP response have a certain header or something.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Brian Wolff

2:53 a.m.

On 2013-03-11 4:32 PM, "Tyler Romeo" tylerromeo@gmail.com wrote:

...

Honestly, the solution could be as simple as requiring that the HTTP response have a certain header or something.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ok. I withdraw my security related objections :). Some sort of header based checking to make sure the posts are wanted sounds sane (provided that very initially a get request is used to verify this. Post requests to arbitrary unverified urls can be dangerous.).

-bawolff

Tyler Romeo

3:26 a.m.

On Mon, Mar 11, 2013 at 3:53 PM, Brian Wolff bawolff@gmail.com wrote:

...

(provided that very initially a get request is used to verify this. Post requests to arbitrary unverified urls can be dangerous.).

Totally agreed. If anything the GET request could be used to obtain initial information about the client, such as which channels to subscribe to.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

Brian Wolff

11 Mar 11 Mar

6:53 a.m.

...

If "Whatever the JSON encoder we use does" means that one day, the daemon starts sending UTF-8 encoded characters, it is quite possible that existing clients will break because of previously unnoticed encoding bugs. So I would like to see some formal documentation of the protocol.

Json standard is pretty clear that any character can be escaped using \u <utf-16> code point or you can just have things be utf8. If clients break because they can't handle that, that is the client's fault. Its not a hard requirement.

I see no reason why we couldnt change later if need be. Furthermore I see no reason why we would care which way we went on that issue. The raw json isnt meant for human eyes.

-bawolff

Matthew Flaschen

10:19 a.m.

On 03/10/2013 07:53 PM, Brian Wolff wrote:

...

Json standard is pretty clear that any character can be escaped using \u <utf-16> code point or you can just have things be utf8. If clients break because they can't handle that, that is the client's fault. Its not a hard requirement.

Just a note, the JSON RFC (https://www.ietf.org/rfc/rfc4627.txt, section 3) explicitly allows any of the main Unicode encodings (UTF-32, UTF-16, UTF-8) with both endiannesses (except of course UTF-8).

UTF-8 *is* the default encoding, and it's the best choice, but not the only one.

Matt Flaschen

Brian Wolff

10 Mar 10 Mar

9:31 p.m.

On 2013-03-10 1:20 AM, "Victor Vasiliev" vasilvv@gmail.com wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi everybody,

For long time it was acknowledged that our current way of serving the recent changes feed to users (IRC with formatting using funny control codes) is one of the worst-suited for this purpose. It made the life miserable both for users who had to parse it (since nobody is actually reading it from IRC) and for developers who had to fit that thing into IRC line length limit. Time passed, and many ways were suggested to fix this (including https://meta.wikimedia.org/wiki/Recentchanges_via_XMPP and <

https://www.mediawiki.org/wiki/Requests_for_comment/Structured_data_push_not...

...

), but nobody actually went ahead and made it work.

After recent discussion on this list I realized that this has been in discussion for as long as four years I went WTF and decided to Just Go Ahead and Fix It. As a result, I made a patch to MediaWiki which allows it to output recent changes feed in JSON: https://gerrit.wikimedia.org/r/#/c/52922/

Also, I wrote a daemon which captures this feed and serves them through WebSockets and simple text-oriented protocol which serves same JSON without WebSocket wrapping (for poor souls writing in languages without proper WebSocket support): https://github.com/wikimedia/mediawiki-rcsub

This daemon is written in Python using Twisted and Autobahn and it takes ~200 lines of code (initial version took ~80).

As a bonus, this involves no XML streaming in any form (unlike XMPP or PubSubHubbub), so the unicorns are happy and unharmed, and minds of programmers implementing this will remain unfried.

I hope that now getting recent changes via reasonable format is a matter of code review and deployment, and we will finally get something reasonable to work with (with access from web browsers!).

-- Victor.

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJRPBffAAoJEHEOTaoYvDHXUCMP/jml/EGAxXLuz1sGrS5R0iRF EJCjUKkysl1Gw0Wmr597UETtF1BCHh1myicGBN6tEjEd4N9rkNC8embBIdMjnlNN KFfJeg4cSMhfIprjFQHdYjy3jw6mK1Kr87jc/KIWkDdWwoV5EmcbQ/cGc/UQrcd2 9cVmc3qUXWEf/oxhv3nGTfeW6gJDRZshpB66+YNr5LzAaBhroastW1r0b8UDXZt9 3u1BOr9lcHbi62DLqPOCH+aXljOidrjoWff+cV9CzUS9M4axcHThzu4Eo1s7EpgX iWPVTuk3By3/EPxk9gJPETl7oPET6qNvNkUzix9Enu3iGuaWwEcano8xgFIfAWp8 /Prf00xIe6VjMWssb3M+G9OkaclDBTPnMs9WxYMGHui8SZT62zQowJKeF+HrphjA A/rrpHEfQz4TlutrvtPthSKTAICzuXDcnXLUxIHhvJfVF6iq57ntA8iJ2vrrqQge ISOIZRgfDNQFb1UOER4P5VsXN1fKaP72OCSbP9smlVOtWgoCz0IqifdFSvc/Wo/O Fj5cafbPPB8R0AqMb29bnv89u6SvVCh5Y3v9pK5523xo0LVP+WGXe+WNuxW9jjeZ +y/d3EQTjl40pP/MzsBxR+BCz+Q84myjmpO0FvmPPxqxnA2bz0dSyfYyZlIIu7Mj zesgY0TGThmu12q0Y068 =oGgQ -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Good work. Its wonderful to see people just go and fix things that need fixing instead of the usual bikesheddingness that often takes place.

-bawolff

P.s. I too used to read the irc rc feed (back when i was an active editor at enwikinews). It can be useful on smaller projects

4321

Age (days ago)

4322

Last active (days ago)

wikitech-l@lists.wikimedia.org

23 comments

9 participants

tags (0)

participants (9)

Bartosz Dziewoński
Brian Wolff
Jeroen De Dauw
Kevin Israel
Liangent
Matthew Flaschen
Petr Bena
Tyler Romeo
Victor Vasiliev