Dear Jeremy, all,
In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH
protocol to get update feeds for English, German & Dutch WIkipedia.
This OAI-PMH implementation  is very convenient for what we need (and
I guess for most people) because it uses the latest modification date for
So when we ask for updates after time X it returns a list of articles
with modification date after X, no matter how many times they were edited
This is very easy for you to support (no need for extra hardware, just
an extra table / index) and suited best for most use cases.
What most people need in the end is to know which pages changed since
time X. Fine grained details are for special type of clients.
On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann <
please read email from Daniel Kinzler on this list from 26.03.2013 18:26
* A dispatcher needs about 3 seconds to dispatch 1000 changes to a
> client wiki.
> * Considering we have ~300 client wikis, this means one dispatcher can
> about 4000 changes per hour.
> * We currently have two dispatchers running in parallel (on a single
> box, hume),
> that makes a capacity of 8000 changes/hour.
> * We are seeing roughly 17000 changes per hour on wikidata.org
> than twice
> our dispatch capacity.
> * I want to try running 6 dispatcher processes; that would give us the
> to handle 24000 changes per hour (assuming linear scaling).
1. Somebody needs to run the Hub and it needs to scale. Looks like the
protocol was intended to save some traffic, not to dispatch a massive
amount of messages / per day to a large number of clients. Again, I am not
familiar, how efficient PubSubHubbub is. What kind of hardware is needed to
run this, effectively? Do you have experience with this?
2. Somebody will still need to run and maintain the Hub and feed all
clients. I was offering to host one of the hubs for DBpedia users, but I am
not sure, whether we have that capacity.
So we should use IRC RC + http request to the changed page as fallback?
Am 26.04.2013 08:06, schrieb Jeremy Baron:
> On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
>> Well, PubSubHubbub is a nice idea. However it clearly depends on two
>> 1. whether Wikidata sets up such an infrastructure (I need to check
>> whether we have capacities, I am not sure atm)
> Capacity for what? the infrastructure should be not be a problem.
> (famous last words, can look more closely tomorrow. but I'm really not
> worried about it) And you don't need any infrastructure at all for
> development; just use one of google's public instances.
> 2. whether performance is good enough to handle high-volume publishers
> Again, how do you mean?
> Basically, polling to recent changes  and then do a http request
>> to the individual pages should be fine for a start. So I guess this is what
>> we will implement, if there aren't any better suggestions.
>> The whole issue is problematic and the DBpedia project would be
>> happy, if this were discussed and decided right now, so we can plan
>> What is the best practice to get updates from Wikipedia at the moment?
> I believe just about everyone uses the IRC feed from
> I imagine wikidata will or maybe already does propagate changes to a
> channel on that server but I can imagine IRC would not be a good
> method for many Instant data repo users. Some will not be able to
> sustain a single TCP connection for extended periods, some will not be
> able to use IRC ports at all, and some may go offline periodically.
> e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
> problems and is better than the current IRC solution in just about
> every way.
> We could potentially even replace the current cross-DB job queue
> insert crazyness with PubSubHubbub for use on the cluster internally.
> Wikidata-l mailing list
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Wikidata-l mailing list
Wikidata-l mailing list