Re: [Wikitech-l] Public Event Streams (AKA RCStream replacement) question

26 Sep 2016

      Thanks for feedback so far, this is great.
...
If I’m not using whatever language happens to have a library already
written, there’s no spec so I have to reverse-engineer it from an
implementation.
Brad, sorry, that is just an example on the nodejs Kasocki library.  We
will need more non language specific docs about how to interact with this
service, no matter what it might be.  In either the socket.io or
SSE/EventSource (HTTP) cases, you will need a client library, so there will
be language specific code and documentation needed.  But, the interface
will be documented in a non language specific way.  Keep on me about this.
When this thing is ‘done’, if the documentation is not what you are looking
for, yell at me and I will make it better! :)
BTW, given that halfak wrote the original proposal[1] for this project, and
that he maintains a Python abstraction for MediaWiki Events[2] based on
recent changes, I wouldn’t be surprised if he (or someone) incorporated a
Python abstraction top of EventStreams, whatever transport we end up
choosing.
Antoine, you are right that we that we don’t really have a  plan for
phasing out the older systems.  RCFeed especially, since there are so many
tools built on it.  RCStream will be similar to EventStreams / Kasocki
stuff, and we know that people at least want it to use an up to date
socket.io version, so that might be easier to phase out.  I don’t even know
who maintains RCFeed.  I’ll reach out and see if I can understand this and
make a phase out plan as a subtask of the larger Public Event Streams
project.
...
First of all, why does it need to be a replacement, rather than something that
builds on existing infrastructure?

We want a larger feature set than the existing infrastructure provides.
RCStream is built for only the Recent Changes events, and has no historical
addressing.  Clients should be able to reconnect and start the stream from
where they last left off, or even wherever they choose.  In a dream world,
I’d love to see this thing support timestamp based consumption for any
point in time.  That is, if you wanted to start consuming a stream of edits
starting in March 2013, you could do it.
...
You could consider not implementing streaming /at all/, and just ask
clients to poll an http endpoint, which is much easier to implement
client-side than anything streaming (especially when it comes to handling
disconnects).
True, but I think this would change the way people interact with this
data.  But, maybe that is ok?  I’m not sure.  I’m not a browser developer,
so I don’t know a lot about what is easy or hard in browsers (which is why
I started this thread :) ). But, keeping the stream model intact will be
powerful.  A public resumable stream of Wikimedia events would allow folks
outside of WMF networks to build realtime stream processing tooling on top
of our data.  Folks with their own Spark or Flink or Storm clusters (in
Amazon or labs or wherever) could consume this and perform complex stream
processing (e.g. machine learning algorithms (like ORES), windowed trending
aggregations, etc.).
...
Why not expose the websockets as a standard web socket server so that it
can be consumed by any language/platform that has a standard web socket
implementation?
This is a good question, and not something I had considered.  I started
with socket.io because that was what RCStream used, and it seemed to have a
lot of really nice abstractions and solved problems that I’d have to deal
with myself if I used websockets.  I had assumed that socket.io was
generally preferred to working with websockets, but maybe this is not the
case?
[1]
https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_pub...
[2] https://github.com/mediawiki-utilities/python-mwevents
On Mon, Sep 26, 2016 at 12:28 PM, Joaquin Oltra Hernandez <
jhernandez@wikimedia.org> wrote:
...
Why not expose the websockets as a standard websocket server so that it can
be consumed by any language/platform that has a standard websocket
implementation?
https://www.npmjs.com/package/ws
Pinning to socket.io versions or other abstractions leads to what happened
before, you can get stuck on an old version, and the protocol is specific
to the library and platforms where that library has been implemented.
By using a standard websocket server, you can provide a minimal standards
compliant service that can be consumed across other languages/clients, and
if there are services that need the socket.io features you can provide a
different service that proxies the original one but puts socket.io on top
of it.

For the RCStream use case server sent events (SSE) are a great use case too
(given you don't need bidirectional communication), so that would make a
lot of sense too instead of websockets (it'd probably be easier to scale).
Whatever it is I'd vote for sticking to standard implementations, either
pure websockets or http server sent events, and let shim layers that
provide other features like socket.io be implemented in other proxy
servers.
On Sun, Sep 25, 2016 at 4:02 PM, Merlijn van Deen (valhallasw) <
valhallasw@arctus.nl> wrote:
...
Hi Andrew,
On 23 September 2016 at 23:15, Andrew Otto otto@wikimedia.org wrote:
...
We’ve been busy working on building a replacement for RCStream.  This
new
...
...
service would expose recentchanges as a stream as usual, but also other
types of event streams that we can make public.
First of all, why does it need to be a replacement, rather than something
that builds on existing infrastructure? Re-using the existing
infrastructure provides a much more convenient path for consumers to
upgrade.
...
But we’re having a bit of an existential crisis!  We had originally
chosen
...
to implement this using an up to date socket.io server, as RCStream
also
...
...
uses socket.io.  We’re mostly finished with this, but now we are
taking
...
a
...
step back and wondering if socket.io/websockets are the best
technology
...
to
...
use to expose stream data these days.
For what it's worth, I'm on the fence about socket.io. My biggest
argument
...
for socket.io is the fact that rcstream already uses it, but my
experience
...
with implementing the pywikibot consumer for rcstream is that the Python
libraries are lacking, especially when it comes to stuff like
reconnecting.
...
In addition, debugging issues requires knowledge of both socket.io and
the
...
underlying websockets layer, which are both very different from regular
http.
From the task description, I understand that the goal is to allow easy
resumation by passing information on the the last received message. You
could consider not implementing streaming /at all/, and just ask clients
to
...
poll an http endpoint, which is much easier to implement client-side than
anything streaming (especially when it comes to handling disconnects).
So: My preference would be extending the existing rcstream framework, but
if that's not possible, my preference would be with not streaming at all.
Merlijn
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Public Event Streams (AKA RCStream replacement) question