please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :
* A dispatcher needs about 3 seconds to dispatch 1000
changes to a client wiki.
* Considering we have ~300 client wikis, this means one dispatcher can handle
about 4000 changes per hour.
* We currently have two dispatchers running in parallel (on a single box, hume),
that makes a capacity of 8000 changes/hour.
* We are seeing roughly 17000 changes per hour on wikidata.org
- more than twice
our dispatch capacity.
* I want to try running 6 dispatcher processes; that would give us the capacity
to handle 24000 changes per hour (assuming linear scaling).
1. Somebody needs to run the Hub and it needs to scale. Looks like the
protocol was intended to save some traffic, not to dispatch a massive
amount of messages / per day to a large number of clients. Again, I am
not familiar, how efficient PubSubHubbub is. What kind of hardware is
needed to run this, effectively? Do you have experience with this?
2. Somebody will still need to run and maintain the Hub and feed all
clients. I was offering to host one of the hubs for DBpedia users, but I
am not sure, whether we have that capacity.
So we should use IRC RC + http request to the changed page as fallback?
Am 26.04.2013 08:06, schrieb Jeremy Baron:
On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
Well, PubSubHubbub is a nice idea. However it
clearly depends on two factors:
1. whether Wikidata sets up such an infrastructure (I need to check whether we have
capacities, I am not sure atm)
Capacity for what? the infrastructure should be not
be a problem.
(famous last words, can look more closely tomorrow. but I'm really not
worried about it) And you don't need any infrastructure at all for
development; just use one of google's public instances.
2. whether performance is good enough to handle
Again, how do you mean?
Basically, polling to recent changes  and then
do a http request to the individual pages should be fine for a start. So I guess this is
what we will implement, if there aren't any better suggestions.
The whole issue is problematic and the DBpedia project would be happy, if this were
discussed and decided right now, so we can plan development.
What is the best practice to get updates from Wikipedia at the moment?
just about everyone uses the IRC feed from
I imagine wikidata will or maybe already does propagate changes to a
channel on that server but I can imagine IRC would not be a good
method for many Instant data repo users. Some will not be able to
sustain a single TCP connection for extended periods, some will not be
able to use IRC ports at all, and some may go offline periodically.
e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
problems and is better than the current IRC solution in just about
We could potentially even replace the current cross-DB job queue
insert crazyness with PubSubHubbub for use on the cluster internally.
Wikidata-l mailing list
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org