Hi all,
We’ve been busy working on building a replacement for RCStream. This new service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
But we’re having a bit of an existential crisis! We had originally chosen to implement this using an up to date socket.io server, as RCStream also uses socket.io. We’re mostly finished with this, but now we are taking a step back and wondering if socket.io/websockets are the best technology to use to expose stream data these days.
The alternative is to just use ‘streaming’ HTTP chunked transfer encoding. That is, the client makes a HTTP request for a stream, and the server declares that it will be sending back data indefinitely in the response body. Clients just read (and parse) events out of the HTTP response body. There is some event tooling built on top of this (namely SSE / EventSource), but the basic idea is a never ending streamed HTTP response body.
So, I’m reaching out to to gather some input to help inform a decision. What will be easier for you users of RCStream in the future? Would you prefer to keep using socket.io (newer version), or would you prefer to work directly with HTTP? There seem to be good clients for socket.io and for SSE/EventSource in many languages.
https://phabricator.wikimedia.org/T130651 has more context, but don’t worry about reading it; it is getting a little long. Feel free to chime in there or on this thread.
Thanks! -Andrew Otto
The few times I've tried to look at the existing rcstream service, I've quickly been stymied by not finding any documentation of the actual protocol involved.
Whatever solution is chosen, it would be very nice if there would be easy-to-find documentation that a skilled developer could use to consume the service starting with the ability to make an SSL connection to a server, instead of starting from "use python or nodejs, then require ' socket.io'".
On Fri, Sep 23, 2016 at 5:15 PM, Andrew Otto otto@wikimedia.org wrote:
Hi all,
We’ve been busy working on building a replacement for RCStream. This new service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
But we’re having a bit of an existential crisis! We had originally chosen to implement this using an up to date socket.io server, as RCStream also uses socket.io. We’re mostly finished with this, but now we are taking a step back and wondering if socket.io/websockets are the best technology to use to expose stream data these days.
The alternative is to just use ‘streaming’ HTTP chunked transfer encoding. That is, the client makes a HTTP request for a stream, and the server declares that it will be sending back data indefinitely in the response body. Clients just read (and parse) events out of the HTTP response body. There is some event tooling built on top of this (namely SSE / EventSource), but the basic idea is a never ending streamed HTTP response body.
So, I’m reaching out to to gather some input to help inform a decision. What will be easier for you users of RCStream in the future? Would you prefer to keep using socket.io (newer version), or would you prefer to work directly with HTTP? There seem to be good clients for socket.io and for SSE/EventSource in many languages.
https://phabricator.wikimedia.org/T130651 has more context, but don’t worry about reading it; it is getting a little long. Feel free to chime in there or on this thread.
Thanks! -Andrew Otto _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
So, since most of the dev work for a socket.io implementation is already done, you can see what the protocol would look like here: https://github.com/wikimedia/kasocki#socketio-client-set-up
Kasocki is just a library, the actual WMF deployment and documentation would be more specific about MediaWiki type events, but the interface would be the same. (Likely there would be client libraries to abstract the actual socket.io interaction.)
For HTTP, instead of an RPC style protocol where you configure the stream you want via several socket.emit calls, you’d construct the URI that specifies the event streams, (and partitions and offsets if necessary), and filters you want, and then request it. Perhaps something like ‘http:// .../stream/mediawiki.revsision-create?database=plwiki;rev_len:gt100' (I totally just made this URL up, no idea if it would work this way.).
On Sat, Sep 24, 2016 at 11:41 AM, Andrew Otto otto@wikimedia.org wrote:
So, since most of the dev work for a socket.io implementation is already done, you can see what the protocol would look like here: https://github.com/wikimedia/kasocki#socketio-client-set-up
Kasocki is just a library, the actual WMF deployment and documentation would be more specific about MediaWiki type events, but the interface would be the same. (Likely there would be client libraries to abstract the actual socket.io interaction.)
See, that's the sort of thing I was complaining about. If I'm not using whatever language happens to have a library already written, there's no spec so I have to reverse-engineer it from an implementation. And in this case that seems like socket.io on top of engine.io on top of who knows what else.
Le 24/09/2016 à 22:51, Brad Jorsch (Anomie) a écrit :
On Sat, Sep 24, 2016 at 11:41 AM, Andrew Otto otto@wikimedia.org wrote:
So, since most of the dev work for a socket.io implementation is already done, you can see what the protocol would look like here: https://github.com/wikimedia/kasocki#socketio-client-set-up
Kasocki is just a library, the actual WMF deployment and documentation would be more specific about MediaWiki type events, but the interface would be the same. (Likely there would be client libraries to abstract the actual socket.io interaction.)
See, that's the sort of thing I was complaining about. If I'm not using whatever language happens to have a library already written, there's no spec so I have to reverse-engineer it from an implementation. And in this case that seems like socket.io on top of engine.io on top of who knows what else.
socket.io has libraries in several languages. The RCStream shows example for JavaScript and Python: https://wikitech.wikimedia.org/wiki/RCStream#Client
It is true though that a lib has to be written on top of that to be aware of MediaWiki events dialect.
Le 23/09/2016 à 23:15, Andrew Otto a écrit :
Hi all,
We’ve been busy working on building a replacement for RCStream. This new service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
But we’re having a bit of an existential crisis! We had originally chosen to implement this using an up to date socket.io server, as RCStream also uses socket.io. We’re mostly finished with this, but now we are taking a step back and wondering if socket.io/websockets are the best technology to use to expose stream data these days.
The alternative is to just use ‘streaming’ HTTP chunked transfer encoding.
<snip>
Hello,
As I understand it we have a legacy system we want to replace. It uses an old socket.io with a set of events A.
Since you "are mostly finished with" a replacement that has the latest socket.io I would ship that now and drop/replace the legacy system. With no new events.
From there survey people about changing the transport layer. Which leads
me to a few questions:
- is RCStream actually used? - how many clients? - typology of clients (big corp like Yahoo, Google, volunteers, WMF internal use) ...
Then survey about the change of transport. The red hearing is that if you get mostly volunteers, it is going to be long and tedious to have them change to the new system. AFAIK WMF still maintains an IRC server to stream events which was supposed to be dropped by RCStream. There are still tools and bot relying on IRC protocol with no developers able to do the migration.
You will face the exact same problem by changing to HTTP chunks, and we would end up with: - IRC (legacy) - socket.io (on a legacy / outdated infra) - HTTP chunk
My recommendations are: - to upgrade the current socket.io since it is apparently already done. - Find out who are the consumers of the IRC feed and RCStream, run a survey and figure out what would fit their need best. - Come with a plan to DROP the old systems
And hopefully we end up with a single system from which people can build upon and on which we can introduce new type of events.
My 0.02 €
Hi Andrew,
On 23 September 2016 at 23:15, Andrew Otto otto@wikimedia.org wrote:
We’ve been busy working on building a replacement for RCStream. This new service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
First of all, why does it need to be a replacement, rather than something that builds on existing infrastructure? Re-using the existing infrastructure provides a much more convenient path for consumers to upgrade.
But we’re having a bit of an existential crisis! We had originally chosen to implement this using an up to date socket.io server, as RCStream also uses socket.io. We’re mostly finished with this, but now we are taking a step back and wondering if socket.io/websockets are the best technology to use to expose stream data these days.
For what it's worth, I'm on the fence about socket.io. My biggest argument for socket.io is the fact that rcstream already uses it, but my experience with implementing the pywikibot consumer for rcstream is that the Python libraries are lacking, especially when it comes to stuff like reconnecting. In addition, debugging issues requires knowledge of both socket.io and the underlying websockets layer, which are both very different from regular http.
From the task description, I understand that the goal is to allow easy
resumation by passing information on the the last received message. You could consider not implementing streaming /at all/, and just ask clients to poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects).
So: My preference would be extending the existing rcstream framework, but if that's not possible, my preference would be with not streaming at all.
Merlijn
Why not expose the websockets as a standard websocket server so that it can be consumed by any language/platform that has a standard websocket implementation?
https://www.npmjs.com/package/ws
Pinning to socket.io versions or other abstractions leads to what happened before, you can get stuck on an old version, and the protocol is specific to the library and platforms where that library has been implemented.
By using a standard websocket server, you can provide a minimal standards compliant service that can be consumed across other languages/clients, and if there are services that need the socket.io features you can provide a different service that proxies the original one but puts socket.io on top of it.
---
For the RCStream use case server sent events (SSE) are a great use case too (given you don't need bidirectional communication), so that would make a lot of sense too instead of websockets (it'd probably be easier to scale).
Whatever it is I'd vote for sticking to standard implementations, either pure websockets or http server sent events, and let shim layers that provide other features like socket.io be implemented in other proxy servers.
On Sun, Sep 25, 2016 at 4:02 PM, Merlijn van Deen (valhallasw) < valhallasw@arctus.nl> wrote:
Hi Andrew,
On 23 September 2016 at 23:15, Andrew Otto otto@wikimedia.org wrote:
We’ve been busy working on building a replacement for RCStream. This new service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
First of all, why does it need to be a replacement, rather than something that builds on existing infrastructure? Re-using the existing infrastructure provides a much more convenient path for consumers to upgrade.
But we’re having a bit of an existential crisis! We had originally
chosen
to implement this using an up to date socket.io server, as RCStream also uses socket.io. We’re mostly finished with this, but now we are taking
a
step back and wondering if socket.io/websockets are the best technology
to
use to expose stream data these days.
For what it's worth, I'm on the fence about socket.io. My biggest argument for socket.io is the fact that rcstream already uses it, but my experience with implementing the pywikibot consumer for rcstream is that the Python libraries are lacking, especially when it comes to stuff like reconnecting. In addition, debugging issues requires knowledge of both socket.io and the underlying websockets layer, which are both very different from regular http.
From the task description, I understand that the goal is to allow easy resumation by passing information on the the last received message. You could consider not implementing streaming /at all/, and just ask clients to poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects).
So: My preference would be extending the existing rcstream framework, but if that's not possible, my preference would be with not streaming at all.
Merlijn _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for feedback so far, this is great.
If I’m not using whatever language happens to have a library already
written, there’s no spec so I have to reverse-engineer it from an implementation. Brad, sorry, that is just an example on the nodejs Kasocki library. We will need more non language specific docs about how to interact with this service, no matter what it might be. In either the socket.io or SSE/EventSource (HTTP) cases, you will need a client library, so there will be language specific code and documentation needed. But, the interface will be documented in a non language specific way. Keep on me about this. When this thing is ‘done’, if the documentation is not what you are looking for, yell at me and I will make it better! :)
BTW, given that halfak wrote the original proposal[1] for this project, and that he maintains a Python abstraction for MediaWiki Events[2] based on recent changes, I wouldn’t be surprised if he (or someone) incorporated a Python abstraction top of EventStreams, whatever transport we end up choosing.
Antoine, you are right that we that we don’t really have a plan for phasing out the older systems. RCFeed especially, since there are so many tools built on it. RCStream will be similar to EventStreams / Kasocki stuff, and we know that people at least want it to use an up to date socket.io version, so that might be easier to phase out. I don’t even know who maintains RCFeed. I’ll reach out and see if I can understand this and make a phase out plan as a subtask of the larger Public Event Streams project.
First of all, why does it need to be a replacement, rather than something that
builds on existing infrastructure? We want a larger feature set than the existing infrastructure provides. RCStream is built for only the Recent Changes events, and has no historical addressing. Clients should be able to reconnect and start the stream from where they last left off, or even wherever they choose. In a dream world, I’d love to see this thing support timestamp based consumption for any point in time. That is, if you wanted to start consuming a stream of edits starting in March 2013, you could do it.
You could consider not implementing streaming /at all/, and just ask
clients to poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects). True, but I think this would change the way people interact with this data. But, maybe that is ok? I’m not sure. I’m not a browser developer, so I don’t know a lot about what is easy or hard in browsers (which is why I started this thread :) ). But, keeping the stream model intact will be powerful. A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on top of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex stream processing (e.g. machine learning algorithms (like ORES), windowed trending aggregations, etc.).
Why not expose the websockets as a standard web socket server so that it
can be consumed by any language/platform that has a standard web socket implementation? This is a good question, and not something I had considered. I started with socket.io because that was what RCStream used, and it seemed to have a lot of really nice abstractions and solved problems that I’d have to deal with myself if I used websockets. I had assumed that socket.io was generally preferred to working with websockets, but maybe this is not the case?
[1] https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_pub... [2] https://github.com/mediawiki-utilities/python-mwevents
On Mon, Sep 26, 2016 at 12:28 PM, Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:
Why not expose the websockets as a standard websocket server so that it can be consumed by any language/platform that has a standard websocket implementation?
https://www.npmjs.com/package/ws
Pinning to socket.io versions or other abstractions leads to what happened before, you can get stuck on an old version, and the protocol is specific to the library and platforms where that library has been implemented.
By using a standard websocket server, you can provide a minimal standards compliant service that can be consumed across other languages/clients, and if there are services that need the socket.io features you can provide a different service that proxies the original one but puts socket.io on top of it.
For the RCStream use case server sent events (SSE) are a great use case too (given you don't need bidirectional communication), so that would make a lot of sense too instead of websockets (it'd probably be easier to scale).
Whatever it is I'd vote for sticking to standard implementations, either pure websockets or http server sent events, and let shim layers that provide other features like socket.io be implemented in other proxy servers.
On Sun, Sep 25, 2016 at 4:02 PM, Merlijn van Deen (valhallasw) < valhallasw@arctus.nl> wrote:
Hi Andrew,
On 23 September 2016 at 23:15, Andrew Otto otto@wikimedia.org wrote:
We’ve been busy working on building a replacement for RCStream. This
new
service would expose recentchanges as a stream as usual, but also other types of event streams that we can make public.
First of all, why does it need to be a replacement, rather than something that builds on existing infrastructure? Re-using the existing infrastructure provides a much more convenient path for consumers to upgrade.
But we’re having a bit of an existential crisis! We had originally
chosen
to implement this using an up to date socket.io server, as RCStream
also
uses socket.io. We’re mostly finished with this, but now we are
taking
a
step back and wondering if socket.io/websockets are the best
technology
to
use to expose stream data these days.
For what it's worth, I'm on the fence about socket.io. My biggest
argument
for socket.io is the fact that rcstream already uses it, but my
experience
with implementing the pywikibot consumer for rcstream is that the Python libraries are lacking, especially when it comes to stuff like
reconnecting.
In addition, debugging issues requires knowledge of both socket.io and
the
underlying websockets layer, which are both very different from regular http.
From the task description, I understand that the goal is to allow easy resumation by passing information on the the last received message. You could consider not implementing streaming /at all/, and just ask clients
to
poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects).
So: My preference would be extending the existing rcstream framework, but if that's not possible, my preference would be with not streaming at all.
Merlijn _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on top of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex stream processing (e.g. machine learning algorithms (like ORES), windowed trending aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub) and getting vetoed by ops. If they are not aware yet, might be worth contacting them and asking if the new streaming service would cover their use cases (it was about Wikidata change invalidation on third-party wikis, I think).
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen? How big is the latency? How does recovery/re-sync work after disconnect/downtime?
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on top of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex stream processing (e.g. machine learning algorithms (like ORES), windowed trending aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub) and getting vetoed by ops. If they are not aware yet, might be worth contacting them and asking if the new streaming service would cover their use cases (it was about Wikidata change invalidation on third-party wikis, I think). _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The big questions here is: how does it scale?
This new service is stateless and is backed by Kafka. So, theoretically at least, it should be horizontally scalable. (Add more Kafka brokers, add more service workers.)
And then there’s several more important details to sort out: What's the granularity
of subscription . A topic, which is generically defined, and does not need to be tied to anything MediaWiki specific. If you are interested in recentchanges events, the granularity will be the same as RCStream.
(Well ok, technically the granularity is topic-partition. But for streams with low enough volume, topics will only have a single partition, so in practice the granularity is topic.)
> Where does filtering by namespace etc happen? Filtering is not yet totally hammered out. We aren’t sure what kind of server side filtering we want to actually support in production. Ideally we’d get real fancy and allow complex filtering, but there are likely performance and security concerns here. Even so, filtering will be configured by the client, and at the least you will be able to do glob filtering on any number of keys, and maybe an array of possible values. E.g. if you wanted to filter recentchanges events for plwiki and namespace == 0, filters might look like: { “database”: “plwiki”, “page_namespace”: 0 }
How big is the latency?
For MediaWiki origin streams, in normal operation, probably around a few seconds. This highly depends on how many Kafka clusters we have to go through before the event gets to the one from which this service is backed. This isn’t productionized yet, so we aren’t totally sure which Kafka cluster these events will be served from.
How does recovery/re-sync work after disconnect/downtime?
Events will be given to the client with their offsets in the stream. During connection, a client can configure the offset that it wants to start consuming at. This is kind of like seeking to a particular location in a file, but instead of a byte offset, you are starting at a certain event offset in the stream. In the future (when Kafka supports it), we will support timestamp based subscription as well. E.g. ‘ subscribe to recentchanges event starting at time T.’ This will only work as long as event at offset N or time T still exist in Kafka. Kafka is usually used as a rolling buffer from which old events are removed. We will at least keep events for 7 days, but at this time I don’t see a technical reason we couldn’t keep events for much longer.
Anyway, if anyone has a good solution for sending wiki-events to a large
number of subscribers yes, please let us (WMDE/Wikidata) know about it! The first use case is not something like this. The upcoming production deployment will likely not be large enough to support many thousands of connections. BUT! There is no technical reason we couldn’t. If all goes well, and WMF can be convinced to buy enough hardware, this may be possible! :)
On Tue, Sep 27, 2016 at 3:50 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen? How big is the latency? How does recovery/re-sync work after disconnect/downtime?
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on
top
of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex
stream
processing (e.g. machine learning algorithms (like ORES), windowed
trending
aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub) and getting vetoed by ops. If they are not aware yet, might be worth
contacting
them and asking if the new streaming service would cover their use cases (it was about Wikidata change invalidation on third-party wikis, I
think).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello,
Regarding Wikidata, it is important to make the distinction here between the WMF internal use and public-facing facilities. The underlying sub-system that the public event streams will be relying on is called EventBus~[1], which is (currently) comprised of:
(i) The producer HTTP proxy service. It allows (internal) users to produce events using a REST HTTP interface. It also validates events against the currently-supported set of JSON event schemas~[2]. (ii) The Kafka cluster, which is in charge of queuing the produced events and delivering them to consumer clients. The event streams are separated into topics, e.g. a revision-create topic, a page-move topic, etc. (iii) The Change Propagation service~[3]. It is the main Kafka consumer at this point. In its most basic form, it executes HTTP requests triggered by user-defined rules for certain topics. The aim of the service is to able to update dependant entities starting from a resource/event. One example is recreating the needed data for a page when it is edited. When a user edits a page, ChangeProp receives an event in the revision-create topic and sends a no-cache request to RESTBase to render it. After RB has completed the request, another request is sent to the mobile content service to do the same, because the output of the mobile content service for a given page relies on the latest RB/Parsoid HTML.
Currently, the biggest producer of events is MediaWiki itself. The aim of this e-mail thread is to add a forth component to the system - the public event stream consumption. However, for the Wikidata case, we think the Change Propagation service should be used (i.e. we need to keep it internal). If you recall, Daniel, we did kind of start talking about putting WD updates onto EventBus in Esino Lario.
In-lined the responses to your questions.
On 27 September 2016 at 14:50, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
Yes, it does. Albeit, not instantly. We limit the concurrency of execution to mitigate huge spikes and overloading the system. For example, Change Propagation handles template transclusions: when a template is edited, all of the pages it is transcluded in need to re-rendered, i.e. their HTMLs have to be recreated. For important templates, that might mean re-rendering millions of pages. The queue is populated with the relevant pages and the backlog is "slowly" processed. "Slowly" here refers to the fact that at most X pages are re-rendered at the same time, where X is governed by the concurrency factor. In the concrete example of important templates, it usually takes a couple of days to go through the backlog of re-renders.
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen?
As Andrew noted, the basic granularity is the topic, i.e. the type/schema of the events that are to be received. Roughly, that means that a consumer can obtain either all page edits, or page renames (for all WMF wikis) without performing any kind of filtering. Change Propagation, however, allows one to filter events out based on any of the fields contained in the events themselves, which means you are able to receive only events for a specific wiki, a specific page or namespace. For example, Change Propagation already handles situations where a Wikidata item is edited: it re-renders the page summaries for all pages that the given item is transcluded in, but does so only for the www.wikidata.org domain and namespace 0~[4].
How big is the latency?
For MediaWiki events, the observed latency of acting on an event has been at most a couple of hundred milliseconds on average, but it is usually below that threshold. There are some events, though, which lag behind up to a couple of days, most notably big template updates / transclusions. This graph~[5] plots Change Propagation's delay in processing the events for each defined rule. The "backlog per rule" metric measures the delay between event production and event consumption. Here, event production refers to the time stamp MediaWiki observed the event, while event consumption refers to the time that Change Propagation dequeues it from Kafka and starts executing it.
How does recovery/re-sync work after disconnect/downtime?
Because relying on EventBus and, specifically, Change Propagation, means consuming events via push HTTP requests, the receiving entity does not have to worry about this in this context (public event streams are different matter, though). EventBus handles offsets internally, so even if Change Propagation stops working for some time or cannot connect to Kafka, it will resume processing events form where it left off once the pipeline is accessible again. If, on the other hand, the service receiving the HTTP requests is down or unreachable, Change Propagation has a built-in retry mechanism that is triggered to resend requests whenever an erroneous response is received from the service.
I hope this helps. Would be happy to talk more about this specific topic some more.
Cheers, Marko
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on
top
of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex
stream
processing (e.g. machine learning algorithms (like ORES), windowed
trending
aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub) and getting vetoed by ops. If they are not aware yet, might be worth
contacting
them and asking if the new streaming service would cover their use cases (it was about Wikidata change invalidation on third-party wikis, I
think).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[1] https://www.mediawiki.org/wiki/EventBus [2] https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema [3] https://www.mediawiki.org/wiki/Change_propagation [4] https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/bl... [5] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscree...
Thanks for the feedback everyone!
Due to the simplicity of the HTTP stream model, we are moving forward with that, instead of websockets/socket.io. We hope to have an initial version of this serving existent EventBus events this quarter. Next we will focus on more features (filtering), and also work towards deprecating both RCStream and RCFeed.
You can follow progress of this effort on Fabricator: https://phabricator.wikimedia.org/T130651
On Thu, Sep 29, 2016 at 10:29 AM, Marko Obrovac mobrovac@wikimedia.org wrote:
Hello,
Regarding Wikidata, it is important to make the distinction here between the WMF internal use and public-facing facilities. The underlying sub-system that the public event streams will be relying on is called EventBus~[1], which is (currently) comprised of:
(i) The producer HTTP proxy service. It allows (internal) users to produce events using a REST HTTP interface. It also validates events against the currently-supported set of JSON event schemas~[2]. (ii) The Kafka cluster, which is in charge of queuing the produced events and delivering them to consumer clients. The event streams are separated into topics, e.g. a revision-create topic, a page-move topic, etc. (iii) The Change Propagation service~[3]. It is the main Kafka consumer at this point. In its most basic form, it executes HTTP requests triggered by user-defined rules for certain topics. The aim of the service is to able to update dependant entities starting from a resource/event. One example is recreating the needed data for a page when it is edited. When a user edits a page, ChangeProp receives an event in the revision-create topic and sends a no-cache request to RESTBase to render it. After RB has completed the request, another request is sent to the mobile content service to do the same, because the output of the mobile content service for a given page relies on the latest RB/Parsoid HTML.
Currently, the biggest producer of events is MediaWiki itself. The aim of this e-mail thread is to add a forth component to the system - the public event stream consumption. However, for the Wikidata case, we think the Change Propagation service should be used (i.e. we need to keep it internal). If you recall, Daniel, we did kind of start talking about putting WD updates onto EventBus in Esino Lario.
In-lined the responses to your questions.
On 27 September 2016 at 14:50, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
Yes, it does. Albeit, not instantly. We limit the concurrency of execution to mitigate huge spikes and overloading the system. For example, Change Propagation handles template transclusions: when a template is edited, all of the pages it is transcluded in need to re-rendered, i.e. their HTMLs have to be recreated. For important templates, that might mean re-rendering millions of pages. The queue is populated with the relevant pages and the backlog is "slowly" processed. "Slowly" here refers to the fact that at most X pages are re-rendered at the same time, where X is governed by the concurrency factor. In the concrete example of important templates, it usually takes a couple of days to go through the backlog of re-renders.
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen?
As Andrew noted, the basic granularity is the topic, i.e. the type/schema of the events that are to be received. Roughly, that means that a consumer can obtain either all page edits, or page renames (for all WMF wikis) without performing any kind of filtering. Change Propagation, however, allows one to filter events out based on any of the fields contained in the events themselves, which means you are able to receive only events for a specific wiki, a specific page or namespace. For example, Change Propagation already handles situations where a Wikidata item is edited: it re-renders the page summaries for all pages that the given item is transcluded in, but does so only for the www.wikidata.org domain and namespace 0~[4].
How big is the latency?
For MediaWiki events, the observed latency of acting on an event has been at most a couple of hundred milliseconds on average, but it is usually below that threshold. There are some events, though, which lag behind up to a couple of days, most notably big template updates / transclusions. This graph~[5] plots Change Propagation's delay in processing the events for each defined rule. The "backlog per rule" metric measures the delay between event production and event consumption. Here, event production refers to the time stamp MediaWiki observed the event, while event consumption refers to the time that Change Propagation dequeues it from Kafka and starts executing it.
How does recovery/re-sync work after disconnect/downtime?
Because relying on EventBus and, specifically, Change Propagation, means consuming events via push HTTP requests, the receiving entity does not have to worry about this in this context (public event streams are different matter, though). EventBus handles offsets internally, so even if Change Propagation stops working for some time or cannot connect to Kafka, it will resume processing events form where it left off once the pipeline is accessible again. If, on the other hand, the service receiving the HTTP requests is down or unreachable, Change Propagation has a built-in retry mechanism that is triggered to resend requests whenever an erroneous response is received from the service.
I hope this helps. Would be happy to talk more about this specific topic some more.
Cheers, Marko
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org
wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on
top
of our data. Folks with their own Spark or Flink or Storm clusters
(in
Amazon or labs or wherever) could consume this and perform complex
stream
processing (e.g. machine learning algorithms (like ORES), windowed
trending
aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub)
and
getting vetoed by ops. If they are not aware yet, might be worth
contacting
them and asking if the new streaming service would cover their use
cases
(it was about Wikidata change invalidation on third-party wikis, I
think).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[1] https://www.mediawiki.org/wiki/EventBus [2] https://github.com/wikimedia/mediawiki-event-schemas/tree/ master/jsonschema [3] https://www.mediawiki.org/wiki/Change_propagation [4] https://github.com/wikimedia/mediawiki-services-change- propagation-deploy/blob/ea8cdf85e700b74918a3e59ac6058a 1a952b3e60/scap/templates/config.yaml.j2#L556 [5] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscree...
-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
s/Fabricator/Phabricator/ (gmail auto correct GRR)
On Thu, Oct 20, 2016 at 4:00 PM, Andrew Otto otto@wikimedia.org wrote:
Thanks for the feedback everyone!
Due to the simplicity of the HTTP stream model, we are moving forward with that, instead of websockets/socket.io. We hope to have an initial version of this serving existent EventBus events this quarter. Next we will focus on more features (filtering), and also work towards deprecating both RCStream and RCFeed.
You can follow progress of this effort on Fabricator: https:// phabricator.wikimedia.org/T130651
On Thu, Sep 29, 2016 at 10:29 AM, Marko Obrovac mobrovac@wikimedia.org wrote:
Hello,
Regarding Wikidata, it is important to make the distinction here between the WMF internal use and public-facing facilities. The underlying sub-system that the public event streams will be relying on is called EventBus~[1], which is (currently) comprised of:
(i) The producer HTTP proxy service. It allows (internal) users to produce events using a REST HTTP interface. It also validates events against the currently-supported set of JSON event schemas~[2]. (ii) The Kafka cluster, which is in charge of queuing the produced events and delivering them to consumer clients. The event streams are separated into topics, e.g. a revision-create topic, a page-move topic, etc. (iii) The Change Propagation service~[3]. It is the main Kafka consumer at this point. In its most basic form, it executes HTTP requests triggered by user-defined rules for certain topics. The aim of the service is to able to update dependant entities starting from a resource/event. One example is recreating the needed data for a page when it is edited. When a user edits a page, ChangeProp receives an event in the revision-create topic and sends a no-cache request to RESTBase to render it. After RB has completed the request, another request is sent to the mobile content service to do the same, because the output of the mobile content service for a given page relies on the latest RB/Parsoid HTML.
Currently, the biggest producer of events is MediaWiki itself. The aim of this e-mail thread is to add a forth component to the system - the public event stream consumption. However, for the Wikidata case, we think the Change Propagation service should be used (i.e. we need to keep it internal). If you recall, Daniel, we did kind of start talking about putting WD updates onto EventBus in Esino Lario.
In-lined the responses to your questions.
On 27 September 2016 at 14:50, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
Yes, it does. Albeit, not instantly. We limit the concurrency of execution to mitigate huge spikes and overloading the system. For example, Change Propagation handles template transclusions: when a template is edited, all of the pages it is transcluded in need to re-rendered, i.e. their HTMLs have to be recreated. For important templates, that might mean re-rendering millions of pages. The queue is populated with the relevant pages and the backlog is "slowly" processed. "Slowly" here refers to the fact that at most X pages are re-rendered at the same time, where X is governed by the concurrency factor. In the concrete example of important templates, it usually takes a couple of days to go through the backlog of re-renders.
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen?
As Andrew noted, the basic granularity is the topic, i.e. the type/schema of the events that are to be received. Roughly, that means that a consumer can obtain either all page edits, or page renames (for all WMF wikis) without performing any kind of filtering. Change Propagation, however, allows one to filter events out based on any of the fields contained in the events themselves, which means you are able to receive only events for a specific wiki, a specific page or namespace. For example, Change Propagation already handles situations where a Wikidata item is edited: it re-renders the page summaries for all pages that the given item is transcluded in, but does so only for the www.wikidata.org domain and namespace 0~[4].
How big is the latency?
For MediaWiki events, the observed latency of acting on an event has been at most a couple of hundred milliseconds on average, but it is usually below that threshold. There are some events, though, which lag behind up to a couple of days, most notably big template updates / transclusions. This graph~[5] plots Change Propagation's delay in processing the events for each defined rule. The "backlog per rule" metric measures the delay between event production and event consumption. Here, event production refers to the time stamp MediaWiki observed the event, while event consumption refers to the time that Change Propagation dequeues it from Kafka and starts executing it.
How does recovery/re-sync work after disconnect/downtime?
Because relying on EventBus and, specifically, Change Propagation, means consuming events via push HTTP requests, the receiving entity does not have to worry about this in this context (public event streams are different matter, though). EventBus handles offsets internally, so even if Change Propagation stops working for some time or cannot connect to Kafka, it will resume processing events form where it left off once the pipeline is accessible again. If, on the other hand, the service receiving the HTTP requests is down or unreachable, Change Propagation has a built-in retry mechanism that is triggered to resend requests whenever an erroneous response is received from the service.
I hope this helps. Would be happy to talk more about this specific topic some more.
Cheers, Marko
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org
wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling
on
top
of our data. Folks with their own Spark or Flink or Storm clusters
(in
Amazon or labs or wherever) could consume this and perform complex
stream
processing (e.g. machine learning algorithms (like ORES), windowed
trending
aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub)
and
getting vetoed by ops. If they are not aware yet, might be worth
contacting
them and asking if the new streaming service would cover their use
cases
(it was about Wikidata change invalidation on third-party wikis, I
think).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[1] https://www.mediawiki.org/wiki/EventBus [2] https://github.com/wikimedia/mediawiki-event-schemas/tree/ma ster/jsonschema [3] https://www.mediawiki.org/wiki/Change_propagation [4] https://github.com/wikimedia/mediawiki-services-change-propa gation-deploy/blob/ea8cdf85e700b74918a3e59ac6058a1a952b3e60/ scap/templates/config.yaml.j2#L556 [5] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscree...
-- Marko Obrovac, PhD Senior Services Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sun, Sep 25, 2016 at 10:02 AM, Merlijn van Deen (valhallasw) < valhallasw@arctus.nl> wrote:
You could consider not implementing streaming /at all/, and just ask clients to poll an http endpoint, which is much easier to implement client-side than anything streaming (especially when it comes to handling disconnects).
On the other hand, polling requires repeated TCP handshakes, repeated HTTP headers sent and received, all that work done even when there aren't any new events, non-real-time reception of events (i.e. you only get events when you poll), and decision on what acceptable minimum values for the polling interval are.
And chances are that clients that want to do polling are already doing it with the action API. ;) Although I don't know what events are planned to be made available from this new service to be able to say whether they're all already available via the action API.
wikitech-l@lists.wikimedia.org