2013/4/26 Yuri Astrakhan <yastrakhan@wikimedia.org>

Recently I spoke with Wikia, and being able to subscribe to the recent changes feed is a very important feature to them. Apparently polling API's recent changes creates a much higher stress on the system than subscribing.

Now, we don't need (from the start) to implement publishing of all the data - just the fact that certain items have changed, and they can later be requested by usual means, but it would be good to implement this system for all of the API, not just wikidata.

On Fri, Apr 26, 2013 at 3:13 AM, Dimitris Kontokostas <jimkont@gmail.com> wrote:

Dear Jeremy, all,

In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH protocol to get update feeds for English, German & Dutch WIkipedia.

This OAI-PMH implementation [1] is very convenient for what we need (and I guess for most people) because it uses the latest modification date for update publishing.

So when we ask for updates after time X it returns a list of articles with modification date after X, no matter how many times they were edited in between.

This is very easy for you to support (no need for extra hardware, just an extra table / index) and suited best for most use cases.
What most people need in the end is to know which pages changed since time X. Fine grained details are for special type of clients.

Best,
Dimitris

[1] http://www.mediawiki.org/wiki/Extension:OAIRepository

On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann <hellmann@informatik.uni-leipzig.de> wrote:

Dear Jeremy,
please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

* A dispatcher needs about 3 seconds to dispatch 1000 changes to a client wiki.
* Considering we have ~300 client wikis, this means one dispatcher can handle
about 4000 changes per hour.
* We currently have two dispatchers running in parallel (on a single box, hume),
that makes a capacity of 8000 changes/hour.
* We are seeing roughly 17000 changes per hour on wikidata.org - more than twice
our dispatch capacity.
* I want to try running 6 dispatcher processes; that would give us the capacity
to handle 24000 changes per hour (assuming linear scaling).

1. Somebody needs to run the Hub and it needs to scale. Looks like the protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?

2. Somebody will still need to run and maintain the Hub and feed all clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

Hi,

On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
<hellmann@informatik.uni-leipzig.de> wrote:

Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:
1. whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem.
(famous last words, can look more closely tomorrow. but I'm really not
worried about it) And you don't need any infrastructure at all for
development; just use one of google's public instances.

2. whether performance is good enough to handle high-volume publishers

Again, how do you mean?

Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions.
The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from
irc.wikimedia.org.
https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a
channel on that server but I can imagine IRC would not be a good
method for many Instant data repo users. Some will not be able to
sustain a single TCP connection for extended periods, some will not be
able to use IRC ports at all, and some may go offline periodically.
e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
problems and is better than the current IRC solution in just about
every way.

We could potentially even replace the current cross-DB job queue
insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

--
Kontokostas Dimitris

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.