Hey folks :)
Wikimedia Germany has been working with a team of students over the past months. They have among other things developed a pubsubhubbub extension. The idea is that we allow 3rd parties to easily subscribe to changes made on Wikidata or any other wiki. Wikidata's changes get send to a hub which then notifies all subscribers who are interested in the change. We'd like to get this extension deployed for Wikidata and possibly later other Wikimedia projects. For this they need some more eyes to review them. RobLa suggested I send an email to wikitech-l about it. The review bugs for this are https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 and https://bugzilla.wikimedia.org/show_bug.cgi?id=67118 Thanks for your help getting this ready for deployment.
Cheers Lydia
hi Lydia,
I was wondering whether you are aware of the ResourceSync framework [1] for web-based resource synchronization that was specified in 2012-2013 by NISO and the Open Archives Initiative, which previously specified the Protocol for Metadata Harvesting (OAI-PMH) [2] and Object Reuse and Exchange (OAI-ORE) [3].
ResourceSync contains pull-oriented mechanisms for resource synchronization, specified in ANSI/NISO Z39.99-2014 [4]. But it also contains push-oriented, notification mechanisms that are based on PubSubHubbub. While fully compatible with PubSubHubbub, ResourceSync's notifications do not require a feed at the end of the party that notifies about changes. Rather notifications are directly pushed to a hub that relays them to subscribers.The notification spec [5] is currently in beta. Compliant software is at [6].
ResourceSync also includes provisions to convey metadata and links pertaining to a resource that is subject to synchronization as well as mechanisms that support third parties to discover which synchronization mechanisms a server supports.
For a quick intro to ResourceSync , see the 16-minute video [7] or the slide deck [8].
I felt like bringing ResourceSync to your attention as it seems relevant to the work you are doing.
Greetings
Herbert Van de Sompel Digital Library Research & Prototyping Los Alamos National Laboratory, Research Library http://public.lanl.gov/herbertv/ @hvdsomp
==
[1] http://www.openarchives.org/rs/toc [2] http://www.openarchives.org/OAI/openarchivesprotocol.html [3] http://www.openarchives.org/ore/1.0/ [4] http://www.openarchives.org/rs/1.0/resourcesync [5] http://www.openarchives.org/rs/notification/0.9/notification [6] https://github.com/resync/resourcesync_push [7] https://www.youtube.com/watch?v=ASQ4jMYytsA [8] http://www.slideshare.net/hvdsomp/resource-sync-overview-33045191
On Fri, Jul 4, 2014 at 8:16 AM, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
Hey folks :)
Wikimedia Germany has been working with a team of students over the past months. They have among other things developed a pubsubhubbub extension. The idea is that we allow 3rd parties to easily subscribe to changes made on Wikidata or any other wiki. Wikidata's changes get send to a hub which then notifies all subscribers who are interested in the change. We'd like to get this extension deployed for Wikidata and possibly later other Wikimedia projects. For this they need some more eyes to review them. RobLa suggested I send an email to wikitech-l about it. The review bugs for this are https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 and https://bugzilla.wikimedia.org/show_bug.cgi?id=67118 Thanks for your help getting this ready for deployment.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Jul 4, 2014 at 7:16 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
Wikimedia Germany has been working with a team of students over the past months. They have among other things developed a pubsubhubbub extension. The idea is that we allow 3rd parties to easily subscribe to changes made on Wikidata or any other wiki. Wikidata's changes get send to a hub which then notifies all subscribers who are interested in the change. We'd like to get this extension deployed for Wikidata and possibly later other Wikimedia projects. For this they need some more eyes to review them. RobLa suggested I send an email to wikitech-l about it. The review bugs for this are https://bugzilla.wikimedia.org/show_bug.cgi?id=67117 and https://bugzilla.wikimedia.org/show_bug.cgi?id=67118 Thanks for your help getting this ready for deployment.
Hi Lydia,
Thanks for providing the basic overview of this. Could you (or someone on the team) provide an explanation about how you would like this to be configured on the Wikimedia cluster? Is this something that you see anyone being able to subscribe to, or would this be something that would only be available to a limited list of third parties?
Also, based on our last conversation, it sounds like we're up against some time constraints here with respect to the students' time; could you clarify?
Thanks Rob
Am 08.07.2014 01:46, schrieb Rob Lanphier:
On Fri, Jul 4, 2014 at 7:16 AM, Lydia Pintscher <lydia.pintscher@wikimedia.de
...
Hi Lydia,
Thanks for providing the basic overview of this. Could you (or someone on the team) provide an explanation about how you would like this to be configured on the Wikimedia cluster?
We'd like to enable it just on Wikidata at first, but I see no reason not to enable it for all projects if that goes well.
The PubSubHubbub (PuSH) extension would be configured to push notifications to the google hub (two per edit). The hub then notifies any subscribers via their callback urls.
Is this something that you see anyone being able to subscribe to, or would this be something that would only be available to a limited list of third parties?
As far as I know, this is up to the hub to decide, but in our case, anyone could subscribe.
Note that users would subscribe to the hub, meaning users would expose their IP address to google. However, subscribing means registering your domain and callabck URL with google; the hub is supposed to push to the subscriber, so it needs the subscriber's IP in some form or other.
The subscription URL (at the google hub) is advertized on the HTML head of every wiki page (please correct me if I got the details wrong).
Also, based on our last conversation, it sounds like we're up against some time constraints here with respect to the students' time; could you clarify?
The project ends by the end of next week. Ofter that, it becomes less and less likely for the students to still be around. I of course hope they will be maintaining their projects for years to come, but if we want to be sure to get a quick response, it would be good to get this reviewed asap.
Replying to myself because I forgot to mention an important detail:
Am 08.07.2014 10:22, schrieb Daniel Kinzler:
Am 08.07.2014 01:46, schrieb Rob Lanphier:
On Fri, Jul 4, 2014 at 7:16 AM, Lydia Pintscher <lydia.pintscher@wikimedia.de
...
Hi Lydia,
Thanks for providing the basic overview of this. Could you (or someone on the team) provide an explanation about how you would like this to be configured on the Wikimedia cluster?
We'd like to enable it just on Wikidata at first, but I see no reason not to enable it for all projects if that goes well.
The PubSubHubbub (PuSH) extension would be configured to push notifications to the google hub (two per edit). The hub then notifies any subscribers via their callback urls.
We need a proxy to be set up to allow the app servers to talk to the google hub. If this is deployed on full scale, we expect in excess of 20 POST requests per second (two per edit), plus up to the same number (but probably fewer) of GET requests coming back from the hub, asking for the full page content of every page changed, as XML export, from a special page interface similar to Special:Export. This would probably bypass the web cache.
PubSubHubbub is nice and simple, but it's really designed for news feeds, not for versioned content of massive collaborative sites. It works, but it's not as efficient as we could wish.
Hi,
Is it easy to brief the added value (or supported use cases) by switching to PubSubHubbub? The edit stream in Wikidata is so huge that I can hardly think of anyone wanting to be in *real-time* sync with Wikidata With 20 p/s their infrastructure should be pretty scalable to not break.
Maybe I am biased with DBpedia but by doing some experiments on English Wikipedia we found that the ideal update with OAI-PMH time was every ~5 minutes. OAI aggregates multiple revisions of a page to a single edit so when we ask: "get me the items that changed the last 5 minutes" we skip the processing of many minor edits
It looks like we lose this option with PubSubHubbub right? As we already asked before, does PubSubHubbub supports mirroring a wikidata clone? The OAI-PMH extension has this option
Best, Dimitris
On Tue, Jul 8, 2014 at 11:31 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Replying to myself because I forgot to mention an important detail:
Am 08.07.2014 10:22, schrieb Daniel Kinzler:
Am 08.07.2014 01:46, schrieb Rob Lanphier:
On Fri, Jul 4, 2014 at 7:16 AM, Lydia Pintscher <
lydia.pintscher@wikimedia.de
...
Hi Lydia,
Thanks for providing the basic overview of this. Could you (or someone
on the
team) provide an explanation about how you would like this to be
configured on
the Wikimedia cluster?
We'd like to enable it just on Wikidata at first, but I see no reason
not to
enable it for all projects if that goes well.
The PubSubHubbub (PuSH) extension would be configured to push
notifications to
the google hub (two per edit). The hub then notifies any subscribers via
their
callback urls.
We need a proxy to be set up to allow the app servers to talk to the google hub. If this is deployed on full scale, we expect in excess of 20 POST requests per second (two per edit), plus up to the same number (but probably fewer) of GET requests coming back from the hub, asking for the full page content of every page changed, as XML export, from a special page interface similar to Special:Export. This would probably bypass the web cache.
PubSubHubbub is nice and simple, but it's really designed for news feeds, not for versioned content of massive collaborative sites. It works, but it's not as efficient as we could wish.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Am 09.07.2014 08:14, schrieb Dimitris Kontokostas:
Hi,
Is it easy to brief the added value (or supported use cases) by switching to PubSubHubbub?
* It's easier to handle than OAI, because it uses the standard dump format. * It's also push-based, avoiding constant polling on small wikis. * The OAI extension has been deprecated for a long time now.
The edit stream in Wikidata is so huge that I can hardly think of anyone wanting to be in *real-time* sync with Wikidata With 20 p/s their infrastructure should be pretty scalable to not break.
The "push" aspect is probably most useful for small wikis. It's true, for large wikis, you could just poll, since you would hardly ever poll in vain.
IT would be very nice if the sync could be filtered by namespace, category, etc. But PubSubHubbub (i'll use "PuSH" from now on) doesn't really support this, sadly.
Maybe I am biased with DBpedia but by doing some experiments on English Wikipedia we found that the ideal update with OAI-PMH time was every ~5 minutes. OAI aggregates multiple revisions of a page to a single edit so when we ask: "get me the items that changed the last 5 minutes" we skip the processing of many minor edits It looks like we lose this option with PubSubHubbub right?
I'm not quite positive on this point, but I think with PuSH, this is done by the hub. If the hub gets 20 notifications for the same resource in one minute, it will only grab and distribute the latest version, not all 20.
But perhaps someone from the PuSH development team could confirm this.
As we already asked before, does PubSubHubbub supports mirroring a wikidata clone? The OAI-PMH extension has this option
Yes, there is a client extension for PuSH, allowing for seemless replication of one wiki into another, including creation and deletion (I don't know about moves/renames).
On Wed, Jul 9, 2014 at 6:13 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 09.07.2014 08:14, schrieb Dimitris Kontokostas:
Hi,
Is it easy to brief the added value (or supported use cases) by switching to PubSubHubbub?
- It's easier to handle than OAI, because it uses the standard dump format.
- It's also push-based, avoiding constant polling on small wikis.
- The OAI extension has been deprecated for a long time now.
The edit stream in Wikidata is so huge that I can hardly think of anyone
wanting
to be in *real-time* sync with Wikidata With 20 p/s their infrastructure should be pretty scalable to not break.
The "push" aspect is probably most useful for small wikis. It's true, for large wikis, you could just poll, since you would hardly ever poll in vain.
IT would be very nice if the sync could be filtered by namespace, category, etc. But PubSubHubbub (i'll use "PuSH" from now on) doesn't really support this, sadly.
Maybe I am biased with DBpedia but by doing some experiments on English Wikipedia we found that the ideal update with OAI-PMH time was every ~5
minutes.
OAI aggregates multiple revisions of a page to a single edit so when we ask: "get me the items that changed the last 5 minutes" we
skip the
processing of many minor edits It looks like we lose this option with PubSubHubbub right?
I'm not quite positive on this point, but I think with PuSH, this is done by the hub. If the hub gets 20 notifications for the same resource in one minute, it will only grab and distribute the latest version, not all 20.
But perhaps someone from the PuSH development team could confirm this.
It 'd be great if the dev team can confirm this. Besides push notifications, is polling an option in PuSH? I briefed through the spec but couldn't find this.
As we already asked before, does PubSubHubbub supports mirroring a
wikidata
clone? The OAI-PMH extension has this option
Yes, there is a client extension for PuSH, allowing for seemless replication of one wiki into another, including creation and deletion (I don't know about moves/renames).
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Am 09.07.2014 19:39, schrieb Dimitris Kontokostas:
On Wed, Jul 9, 2014 at 6:13 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de mailto:daniel.kinzler@wikimedia.de> wrote:
Am 09.07.2014 08:14, schrieb Dimitris Kontokostas: > Maybe I am biased with DBpedia but by doing some experiments on English > Wikipedia we found that the ideal update with OAI-PMH time was every ~5 minutes. > OAI aggregates multiple revisions of a page to a single edit > so when we ask: "get me the items that changed the last 5 minutes" we skip the > processing of many minor edits > It looks like we lose this option with PubSubHubbub right? I'm not quite positive on this point, but I think with PuSH, this is done by the hub. If the hub gets 20 notifications for the same resource in one minute, it will only grab and distribute the latest version, not all 20. But perhaps someone from the PuSH development team could confirm this.
It 'd be great if the dev team can confirm this. Besides push notifications, is polling an option in PuSH? I briefed through the spec but couldn't find this.
Yes. You can just poll the interface that the hub uses to fetch new data.
-- daniel
On Thu, Jul 10, 2014 at 3:50 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Am 09.07.2014 19:39, schrieb Dimitris Kontokostas:
On Wed, Jul 9, 2014 at 6:13 PM, Daniel Kinzler <
daniel.kinzler@wikimedia.de
mailto:daniel.kinzler@wikimedia.de> wrote:
Am 09.07.2014 08:14, schrieb Dimitris Kontokostas: > Maybe I am biased with DBpedia but by doing some experiments on
English
> Wikipedia we found that the ideal update with OAI-PMH time was
every ~5
minutes. > OAI aggregates multiple revisions of a page to a single edit > so when we ask: "get me the items that changed the last 5 minutes"
we skip the
> processing of many minor edits > It looks like we lose this option with PubSubHubbub right? I'm not quite positive on this point, but I think with PuSH, this is
done by the
hub. If the hub gets 20 notifications for the same resource in one
minute, it
will only grab and distribute the latest version, not all 20. But perhaps someone from the PuSH development team could confirm
this.
It 'd be great if the dev team can confirm this. Besides push notifications, is polling an option in PuSH? I briefed
through the
spec but couldn't find this.
Yes. You can just poll the interface that the hub uses to fetch new data.
Thanks for the info Daniel
I'm waiting for the dev to confirm the revision merging and one last question / use case from me.
Since you'll sync to an external server (in Google right?), did you set any requirements on the durability of the changesets? I mean, are the changes stored *for ever* or did you set any ttl? e.g. my application breaks for a week and I want to resume, or I download a one-month old dump and want to get in sync, etc
In OAI-PMH I could for instance set the date to 15/01/2001 and get all pages by modification date In PuSH this would require some sort of importing and is probably out of the question right? :)
Cheers, Dimitris
-- daniel
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
wikidata-tech@lists.wikimedia.org