Hello,
Regarding Wikidata, it is important to make the distinction here between the WMF internal use and public-facing facilities. The underlying sub-system that the public event streams will be relying on is called EventBus~[1], which is (currently) comprised of:
(i) The producer HTTP proxy service. It allows (internal) users to produce events using a REST HTTP interface. It also validates events against the currently-supported set of JSON event schemas~[2]. (ii) The Kafka cluster, which is in charge of queuing the produced events and delivering them to consumer clients. The event streams are separated into topics, e.g. a revision-create topic, a page-move topic, etc. (iii) The Change Propagation service~[3]. It is the main Kafka consumer at this point. In its most basic form, it executes HTTP requests triggered by user-defined rules for certain topics. The aim of the service is to able to update dependant entities starting from a resource/event. One example is recreating the needed data for a page when it is edited. When a user edits a page, ChangeProp receives an event in the revision-create topic and sends a no-cache request to RESTBase to render it. After RB has completed the request, another request is sent to the mobile content service to do the same, because the output of the mobile content service for a given page relies on the latest RB/Parsoid HTML.
Currently, the biggest producer of events is MediaWiki itself. The aim of this e-mail thread is to add a forth component to the system - the public event stream consumption. However, for the Wikidata case, we think the Change Propagation service should be used (i.e. we need to keep it internal). If you recall, Daniel, we did kind of start talking about putting WD updates onto EventBus in Esino Lario.
In-lined the responses to your questions.
On 27 September 2016 at 14:50, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Hey Gergo, thanks for the heads up!
The big questions here is: how does it scale? Sending events to 100 clients may work, but does it work for 100 thousand?
Yes, it does. Albeit, not instantly. We limit the concurrency of execution to mitigate huge spikes and overloading the system. For example, Change Propagation handles template transclusions: when a template is edited, all of the pages it is transcluded in need to re-rendered, i.e. their HTMLs have to be recreated. For important templates, that might mean re-rendering millions of pages. The queue is populated with the relevant pages and the backlog is "slowly" processed. "Slowly" here refers to the fact that at most X pages are re-rendered at the same time, where X is governed by the concurrency factor. In the concrete example of important templates, it usually takes a couple of days to go through the backlog of re-renders.
And then there's several more important details to sort out: What's the granularity of subscription - a wiki? A page? Where does filtering by namespace etc happen?
As Andrew noted, the basic granularity is the topic, i.e. the type/schema of the events that are to be received. Roughly, that means that a consumer can obtain either all page edits, or page renames (for all WMF wikis) without performing any kind of filtering. Change Propagation, however, allows one to filter events out based on any of the fields contained in the events themselves, which means you are able to receive only events for a specific wiki, a specific page or namespace. For example, Change Propagation already handles situations where a Wikidata item is edited: it re-renders the page summaries for all pages that the given item is transcluded in, but does so only for the www.wikidata.org domain and namespace 0~[4].
How big is the latency?
For MediaWiki events, the observed latency of acting on an event has been at most a couple of hundred milliseconds on average, but it is usually below that threshold. There are some events, though, which lag behind up to a couple of days, most notably big template updates / transclusions. This graph~[5] plots Change Propagation's delay in processing the events for each defined rule. The "backlog per rule" metric measures the delay between event production and event consumption. Here, event production refers to the time stamp MediaWiki observed the event, while event consumption refers to the time that Change Propagation dequeues it from Kafka and starts executing it.
How does recovery/re-sync work after disconnect/downtime?
Because relying on EventBus and, specifically, Change Propagation, means consuming events via push HTTP requests, the receiving entity does not have to worry about this in this context (public event streams are different matter, though). EventBus handles offsets internally, so even if Change Propagation stops working for some time or cannot connect to Kafka, it will resume processing events form where it left off once the pipeline is accessible again. If, on the other hand, the service receiving the HTTP requests is down or unreachable, Change Propagation has a built-in retry mechanism that is triggered to resend requests whenever an erroneous response is received from the service.
I hope this helps. Would be happy to talk more about this specific topic some more.
Cheers, Marko
I have not read the entire conversation, so the answers might already be there - my appologies if they are, just point me there.
Anyway, if anyone has a good solution for sending wiki-events to a large number of subscribers, yes, please let us (WMDE/Wikidata) know about it!
Am 26.09.2016 um 22:07 schrieb Gergo Tisza:
On Mon, Sep 26, 2016 at 5:57 AM, Andrew Otto otto@wikimedia.org wrote:
A public resumable stream of Wikimedia events would allow folks outside of WMF networks to build realtime stream processing tooling on
top
of our data. Folks with their own Spark or Flink or Storm clusters (in Amazon or labs or wherever) could consume this and perform complex
stream
processing (e.g. machine learning algorithms (like ORES), windowed
trending
aggregations, etc.).
I recall WMDE trying something similar a year ago (via PubSubHubbub) and getting vetoed by ops. If they are not aware yet, might be worth
contacting
them and asking if the new streaming service would cover their use cases (it was about Wikidata change invalidation on third-party wikis, I
think).
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[1] https://www.mediawiki.org/wiki/EventBus [2] https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema [3] https://www.mediawiki.org/wiki/Change_propagation [4] https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/bl... [5] https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=10&fullscree...