Hi!
I'd like to raise a topic of handling change notifications and up-to-date-ness of Wiki pages data with relation to page props.
First, a little background about how I arrived at the issue. I am maintaining Wikidata Query Service, which updates from Wikidata using recent changes API and RDF export format for Wikidata pages. Recently, we have implemented using certain page properties, such as link & statement counts. This is when I discovered the issue: the page properties are not updated when the page (Wikidata item) is edited, but are updated later, as I understand by a job.
Now, this leads to a situation where when I have a recent changes entry, and I look at the RDF export page - which contains page props derived data now - I can not know if page props data is up-to-date or not. Moreover, if the job - some unknown and undefined time later - updates the page props, I get no notification since the modification is not reflected in recent changes. This makes usage of information derived from page props very hard - you never know if the data is stale or whether the data in page props matches the data in the page. The problem is described in more detail in https://phabricator.wikimedia.org/T145712
I'd like to find a solution for it, but not sure how to proceed. The data specific to this case can be easily generated from the data already present in memory during the page update, but I assume there were some reasons why it was deferred. We could make some kind of notification when updating page props, though that would probably seriously increase the number of notifications and thus slow the updates. Also, in some cases, the second notification may not be necessary since the page props were updated before I've processed the first one, but I have no way of knowing it now.
Any advice on how to solve this issue?
Could we emit a page/properties-change event to EventBus when page props are updated? Similar to how we emit an event for revision visibility changes:
https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/...
These events would be available to you as a stream from Kafka, or (soon) as a publicly consumable stream.
I like the idea of having a unique event for this sort of thing. There's a large class of annotations like that that happen on a shifted timescale. E.g. abuse filter tags are applied after an edit is saved. If we are to build a queue of edits for review, we'd like to have up-to-date abuse filter tags too.
On Fri, Sep 23, 2016 at 8:25 AM, Andrew Otto otto@wikimedia.org wrote:
Could we emit a page/properties-change event to EventBus when page props are updated? Similar to how we emit an event for revision visibility changes:
https://github.com/wikimedia/mediawiki-event-schemas/blob/ master/jsonschema/mediawiki/revision/visibility-change/1.yaml
These events would be available to you as a stream from Kafka, or (soon) as a publicly consumable stream. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Sep 23, 2016 at 9:25 AM, Andrew Otto otto@wikimedia.org wrote:
Could we emit a page/properties-change event to EventBus when page props are updated?
I don't know how the event stuff works, but if you can do it by hooking hooks then 'LinksUpdateComplete' would likely be the hook to use.
Note that hook signals not just the page properties being updated, but also pagelinks, imagelinks, externallinks, langlinks, iwlinks (interwikis), templatelinks, and categorylinks.
Hi!
Could we emit a page/properties-change event to EventBus when page props are updated? Similar to how we emit an event for revision visibility changes:
This, however, still is missing a part because, as I understand, EventBus is not seekable. I.e., if I have data up-to-date to timepoint T, and I am now at timepoint N, I can scan recent changes list from T to N and know if certain item X has changed or not. However, since recent changes list has no entries for page props, and events on EventBus past N are lost to me, I have no idea if page props for X changed between T and N. To know that, I need permanent seekable record of changes. Or some flag that says when it was last updated, at least.
Unless of course I'm missing the part where you can seek back on EventBus events, then please point me to the API that allows to do so.
You can seek back on EventBus events, but not permanently (by default, only up to 1 week). If you want to respond to changes in an event stream, you should consume the full event stream realtime and react to the events as they come in. A proper Stream Processing system (like Flink or Spark Streaming) could help with this, but we don’t have that right now. But, I think for your use case, you don’t need a big stream processing system, as this stream will be relatively small, and you don’t need fancy features like time based windowing. You just need to update something based on an event, right?
The change-propagation service that the Services team is building can help you with this. It allows you to consume events, and specify matching rules and actions to take based on those rules.
https://www.mediawiki.org/wiki/Change_propagation
On Fri, Sep 23, 2016 at 2:55 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Could we emit a page/properties-change event to EventBus when page props are updated? Similar to how we emit an event for revision visibility changes:
This, however, still is missing a part because, as I understand, EventBus is not seekable. I.e., if I have data up-to-date to timepoint T, and I am now at timepoint N, I can scan recent changes list from T to N and know if certain item X has changed or not. However, since recent changes list has no entries for page props, and events on EventBus past N are lost to me, I have no idea if page props for X changed between T and N. To know that, I need permanent seekable record of changes. Or some flag that says when it was last updated, at least.
Unless of course I'm missing the part where you can seek back on EventBus events, then please point me to the API that allows to do so.
-- Stas Malyshev smalyshev@wikimedia.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
You can seek back on EventBus events, but not permanently (by default, only up to 1 week). If you want to respond to changes in an event stream, you
1 week is not enough for this use case, but if it could be extended to, say, 1 month, that could be workable.
The reason is that the starting point for the WDQS server install is wikidata dump, which is made weekly. Then the server is catching up to the data that changed from the dump point until the current moment. However, there could be dump failures or other conditions which may make most recent dump unusable. It also takes to load the dump itself. So the delta between current moment and data in freshly deployed WDQS server could be 2 weeks or even more. We need to be able to catch up to the changes since then. We probably will never need the full month, but it's a conservative limit we're using now for how far back we can ask for data. 2 weeks would probably work too even if it could mean some scenarios become more complicated to handle.
should consume the full event stream realtime and react to the events as they come in. A proper Stream Processing system (like Flink or Spark
This is not possible for the WDQS Updater. Since WDQS server is completely independent of Wikidata, it can be started and stopped at anytime. There's no way to ensure that at every moment something is changed in Wikidata all WDQS instances that are interested in this change are up and running. There needs to be an intermediary system that keeps the data. So far recent changes API served as this system, but since it does not know about secondary data, it's no longer enough.
this stream will be relatively small, and you don’t need fancy features like time based windowing. You just need to update something based on an event, right?
Well, I need something based on an even that I can ask something like: "give me all events that happened since time point T". For T being, say, from a second ago to 2 weeks ago.
The change-propagation service that the Services team is building can help you with this. It allows you to consume events, and specify matching rules and actions to take based on those rules.
I see no mention of ability to consume past events. Is it possible?
wikitech-l@lists.wikimedia.org