Hi,
I wonder if there is any guidance about how to poll the recent changes
feed of a MediaWiki instance (in particular of a Wikibase one) to keep
up with its stream of edits? In particular, how to do this responsibly
(without hammering the server) and how to ensure that all changes are
seen by the consumer?
EditGroups (
https://tools.wmflabs.org/editgroups/) currently uses the
WMF Event Stream to do this, which works well but has the downside of
not being available for non-WMF wikis, and the lack of server-side
filtering support, so I have been looking into implementing recent
changes polling in it, so it can be run on other wikis.
So far it looks like my RC polling strategy misses some edits that the
WMF Event Stream includes, so I need to improve this. RC polling is
implemented in the WDQS updater here:
https://github.com/wikimedia/wikidata-query-rdf/blob/master/tools/src/main/…
Is this the best implementation to look at?
And actually - is this really worth doing? Perhaps I should instead
require that the target Wikibase runs the EventLogging extension
(
https://www.mediawiki.org/wiki/Extension:EventLogging) which exposes
the edit stream in a Kafka instance, and then implement a Kafka topic
consumer in EditGroups. It does add requirements on the Wikibase
instance, but if RC polling is brittle, it would be wrong to promise
that EditGroups can be run off a stock MediaWiki instance anyway.
(Note that I still think EditGroups is not a long-term solution. We need
a MediaWiki extension to replace it:
https://phabricator.wikimedia.org/T203557. I am just looking into this
to help our OpenRefine GSoC intern Lu Liu who will be working on
Wikibase support in OpenRefine this summer.)
Cheers,
Antonin