The big questions here is: how does it scale?
This new service is stateless and is backed by Kafka. So, theoretically at least, it should be horizontally scalable. (Add more Kafka brokers, add more service workers.)
And then there’s several more important details to sort out: What's the granularity
of subscription . A topic, which is generically defined, and does not need to be tied to anything MediaWiki specific. If you are interested in recentchanges events, the granularity will be the same as RCStream.
(Well ok, technically the granularity is topic-partition. But for streams with low enough volume, topics will only have a single partition, so in practice the granularity is topic.)
> Where does filtering by namespace etc happen? Filtering is not yet totally hammered out. We aren’t sure what kind of server side filtering we want to actually support in production. Ideally we’d get real fancy and allow complex filtering, but there are likely performance and security concerns here. Even so, filtering will be configured by the client, and at the least you will be able to do glob filtering on any number of keys, and maybe an array of possible values. E.g. if you wanted to filter recentchanges events for plwiki and namespace == 0, filters might look like: { “database”: “plwiki”, “page_namespace”: 0 }
How big is the latency?
For MediaWiki origin streams, in normal operation, probably around a few seconds. This highly depends on how many Kafka clusters we have to go through before the event gets to the one from which this service is backed. This isn’t productionized yet, so we aren’t totally sure which Kafka cluster these events will be served from.
How does recovery/re-sync work after disconnect/downtime?
Events will be given to the client with their offsets in the stream. During connection, a client can configure the offset that it wants to start consuming at. This is kind of like seeking to a particular location in a file, but instead of a byte offset, you are starting at a certain event offset in the stream. In the future (when Kafka supports it), we will support timestamp based subscription as well. E.g. ‘ subscribe to recentchanges event starting at time T.’ This will only work as long as event at offset N or time T still exist in Kafka. Kafka is usually used as a rolling buffer from which old events are removed. We will at least keep events for 7 days, but at this time I don’t see a technical reason we couldn’t keep events for much longer.
Anyway, if anyone has a good solution for sending wiki-events to a large
number of subscribers yes, please let us (WMDE/Wikidata) know about it! The first use case is not something like this. The upcoming production deployment will likely not be large enough to support many thousands of connections. BUT! There is no technical reason we couldn’t. If all goes well, and WMF can be convinced to buy enough hardware, this may be possible! :)
On Tue, Sep 27, 2016 at 3:50 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen? How big is the latency? How does recovery/re-sync work after disconnect/downtime?
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on
top
of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex
stream
processing (e.g. machine learning algorithms (like ORES), windowed
trending
aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub) and getting vetoed by ops. If they are not aware yet, might be worth
contacting
them and asking if the new streaming service would cover their use cases (it was about Wikidata change invalidation on third-party wikis, I
think).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l