Hi,
Is it easy to brief the added value (or supported use cases) by switching to PubSubHubbub? The edit stream in Wikidata is so huge that I can hardly think of anyone wanting to be in *real-time* sync with Wikidata With 20 p/s their infrastructure should be pretty scalable to not break.
Maybe I am biased with DBpedia but by doing some experiments on English Wikipedia we found that the ideal update with OAI-PMH time was every ~5 minutes. OAI aggregates multiple revisions of a page to a single edit so when we ask: "get me the items that changed the last 5 minutes" we skip the processing of many minor edits
It looks like we lose this option with PubSubHubbub right? As we already asked before, does PubSubHubbub supports mirroring a wikidata clone? The OAI-PMH extension has this option
Best, Dimitris
On Tue, Jul 8, 2014 at 11:31 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Replying to myself because I forgot to mention an important detail:
Am 08.07.2014 10:22, schrieb Daniel Kinzler:
Am 08.07.2014 01:46, schrieb Rob Lanphier:
On Fri, Jul 4, 2014 at 7:16 AM, Lydia Pintscher <
lydia.pintscher@wikimedia.de
...
Hi Lydia,
Thanks for providing the basic overview of this. Could you (or someone
on the
team) provide an explanation about how you would like this to be
configured on
the Wikimedia cluster?
We'd like to enable it just on Wikidata at first, but I see no reason
not to
enable it for all projects if that goes well.
The PubSubHubbub (PuSH) extension would be configured to push
notifications to
the google hub (two per edit). The hub then notifies any subscribers via
their
callback urls.
We need a proxy to be set up to allow the app servers to talk to the google hub. If this is deployed on full scale, we expect in excess of 20 POST requests per second (two per edit), plus up to the same number (but probably fewer) of GET requests coming back from the hub, asking for the full page content of every page changed, as XML export, from a special page interface similar to Special:Export. This would probably bypass the web cache.
PubSubHubbub is nice and simple, but it's really designed for news feeds, not for versioned content of massive collaborative sites. It works, but it's not as efficient as we could wish.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech