Thanks for feedback so far, this is great.
If I’m not using whatever language happens to have a library already
written, there’s no spec so I have to reverse-engineer it from an implementation. Brad, sorry, that is just an example on the nodejs Kasocki library. We will need more non language specific docs about how to interact with this service, no matter what it might be. In either the socket.io or SSE/EventSource (HTTP) cases, you will need a client library, so there will be language specific code and documentation needed. But, the interface will be documented in a non language specific way. Keep on me about this. When this thing is ‘done’, if the documentation is not what you are looking for, yell at me and I will make it better! :)
BTW, given that halfak wrote the original proposal[1] for this project, and that he maintains a Python abstraction for MediaWiki Events[2] based on recent changes, I wouldn’t be surprised if he (or someone) incorporated a Python abstraction top of EventStreams, whatever transport we end up choosing.
Antoine, you are right that we that we don’t really have a plan for phasing out the older systems. RCFeed especially, since there are so many tools built on it. RCStream will be similar to EventStreams / Kasocki stuff, and we know that people at least want it to use an up to date socket.io version, so that might be easier to phase out. I don’t even know who maintains RCFeed. I’ll reach out and see if I can understand this and make a phase out plan as a subtask of the larger Public Event Streams project.
First of all, why does it need to be a replacement, rather than something that
builds on existing infrastructure? We want a larger feature set than the existing infrastructure provides. RCStream is built for only the Recent Changes events, and has no historical addressing. Clients should be able to reconnect and start the stream from where they last left off, or even wherever they choose. In a dream world, I’d love to see this thing support timestamp based consumption for any point in time. That is, if you wanted to start consuming a stream of edits starting in March 2013, you could do it.
You could consider not implementing streaming /at all/, and just ask
clients to poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects). True, but I think this would change the way people interact with this data. But, maybe that is ok? I’m not sure. I’m not a browser developer, so I don’t know a lot about what is easy or hard in browsers (which is why I started this thread :) ). But, keeping the stream model intact will be powerful. A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on top of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex stream processing (e.g. machine learning algorithms (like ORES), windowed trending aggregations, etc.).
Why not expose the websockets as a standard web socket server so that it
can be consumed by any language/platform that has a standard web socket implementation? This is a good question, and not something I had considered. I started with socket.io because that was what RCStream used, and it seemed to have a lot of really nice abstractions and solved problems that I’d have to deal with myself if I used websockets. I had assumed that socket.io was generally preferred to working with websockets, but maybe this is not the case?
[1] https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_pub... [2] https://github.com/mediawiki-utilities/python-mwevents
On Mon, Sep 26, 2016 at 12:28 PM, Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:
Why not expose the websockets as a standard websocket server so that it can be consumed by any language/platform that has a standard websocket implementation?
https://www.npmjs.com/package/ws
Pinning to socket.io versions or other abstractions leads to what happened before, you can get stuck on an old version, and the protocol is specific to the library and platforms where that library has been implemented.
By using a standard websocket server, you can provide a minimal standards compliant service that can be consumed across other languages/clients, and if there are services that need the socket.io features you can provide a different service that proxies the original one but puts socket.io on top of it.
For the RCStream use case server sent events (SSE) are a great use case too (given you don't need bidirectional communication), so that would make a lot of sense too instead of websockets (it'd probably be easier to scale).
Whatever it is I'd vote for sticking to standard implementations, either pure websockets or http server sent events, and let shim layers that provide other features like socket.io be implemented in other proxy servers.
On Sun, Sep 25, 2016 at 4:02 PM, Merlijn van Deen (valhallasw) < valhallasw@arctus.nl> wrote:
Hi Andrew,
On 23 September 2016 at 23:15, Andrew Otto otto@wikimedia.org wrote:
We’ve been busy working on building a replacement for RCStream. This
new
service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
First of all, why does it need to be a replacement, rather than something that builds on existing infrastructure? Re-using the existing infrastructure provides a much more convenient path for consumers to upgrade.
But we’re having a bit of an existential crisis! We had originally
chosen
to implement this using an up to date socket.io server, as RCStream
also
uses socket.io. We’re mostly finished with this, but now we are
taking
a
step back and wondering if socket.io/websockets are the best
technology
to
use to expose stream data these days.
For what it's worth, I'm on the fence about socket.io. My biggest
argument
for socket.io is the fact that rcstream already uses it, but my
experience
with implementing the pywikibot consumer for rcstream is that the Python libraries are lacking, especially when it comes to stuff like
reconnecting.
In addition, debugging issues requires knowledge of both socket.io and
the
underlying websockets layer, which are both very different from regular http.
From the task description, I understand that the goal is to allow easy resumation by passing information on the the last received message. You could consider not implementing streaming /at all/, and just ask clients
to
poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects).
So: My preference would be extending the existing rcstream framework, but if that's not possible, my preference would be with not streaming at all.
Merlijn _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l