The big questions here is: how does it scale?
This new service is stateless and is backed by Kafka. So, theoretically at
least, it should be horizontally scalable. (Add more Kafka brokers, add
more service workers.)
And then there’s several more important details to
sort out: What's the granularity
of subscription
.
A topic, which is generically defined, and does not need to be tied to
anything MediaWiki specific. If you are interested in recentchanges
events, the granularity will be the same as RCStream.
(Well ok, technically the granularity is topic-partition. But for streams
with low enough volume, topics will only have a single partition, so in
practice the granularity is topic.)
>
Where does filtering by namespace
etc happen?
Filtering is not yet totally hammered out. We aren’t sure what kind of
server side filtering we want to actually support in production. Ideally
we’d get real fancy and allow complex filtering, but there are likely
performance and security concerns here. Even so, filtering will be
configured by the client, and at the least you will be able to do glob
filtering on any number of keys, and maybe an array of possible values.
E.g. if you wanted to filter recentchanges events for plwiki and namespace
== 0, filters might look like:
{
“database”: “plwiki”,
“page_namespace”: 0
}
How big is the latency?
For MediaWiki origin
streams, in normal operation, probably around a few
seconds. This highly depends on how many Kafka clusters we have to go
through before the event gets to the one from which this service is
backed. This isn’t productionized yet, so we aren’t totally sure which
Kafka cluster these events will be served from.
How does recovery/re-sync work after
disconnect/downtime?
Events will be given to the client with their offsets in the
stream.
During connection, a client can configure the offset that it wants to start
consuming at. This is kind of like seeking to a particular location in a
file, but instead of a byte offset, you are starting at a certain event
offset in the stream. In the future (when Kafka supports it), we will
support timestamp based subscription as well. E.g. ‘ subscribe to
recentchanges event starting at time T.’ This will only work as long as
event at offset N or time T still exist in Kafka. Kafka is usually used as
a rolling buffer from which old events are removed. We will at least keep
events for 7 days, but at this time I don’t see a technical reason we
couldn’t keep events for much longer.
Anyway, if anyone has a good solution for sending
wiki-events to a large
number of subscribers yes, please let us (WMDE/Wikidata) know
about it!
The first use case is not something like this. The upcoming production
deployment will likely not be large enough to support many thousands of
connections. BUT! There is no technical reason we couldn’t. If all goes
well, and WMF can be convinced to buy enough hardware, this may be
possible! :)
On Tue, Sep 27, 2016 at 3:50 PM, Daniel Kinzler <daniel.kinzler(a)wikimedia.de
wrote:
> Hey Gergo, thanks for the heads up!
>
> The big questions here is: how does it scale? Sending events to 100
> clients may
> work, but does it work for 100 thousand?
>
> And then there's several more important details to sort out: What's the
> granularity of subscription - a wiki? A page? Where does filtering by
> namespace
> etc happen? How big is the latency? How does recovery/re-sync work after
> disconnect/downtime?
>
> I have not read the entire conversation, so the answers might already be
> there -
> my appologies if they are, just point me there.
>
Anyway, if anyone has a good solution for sending
wiki-events to a large
> number
> of subscribers, yes, please let us (WMDE/Wikidata) know about it!
>
> Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
> > On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto <otto(a)wikimedia.org
wrote:
> >
> >> A public resumable stream of Wikimedia events would allow folks
> >> outside of WMF networks to build realtime stream processing tooling on
> top
> >> of our data. Folks with their own Spark or Flink or Storm clusters (in
> >> Amazon or labs or wherever) could consume this and perform complex
> stream
> >> processing (e.g. machine learning algorithms (like ORES), windowed
> trending
> >> aggregations, etc.).
> >>
> >
> > I recall WMDE trying something similar a year ago (via PubSubHubbub) and
> > getting vetoed by ops. If they are not aware yet, might be worth
> contacting
> > them and asking if the new streaming service would cover their use cases
> > (it was about Wikidata change invalidation on third-party wikis, I
> think).
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l(a)lists.wikimedia.org
> >
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>