Recently I spoke with Wikia, and being able to subscribe to the recent changes feed is a very important feature to them. Apparently polling API's recent changes creates a much higher stress on the system than subscribing.Now, we don't need (from the start) to implement publishing of all the data - just the fact that certain items have changed, and they can later be requested by usual means, but it would be good to implement this system for all of the API, not just wikidata.On Fri, Apr 26, 2013 at 3:13 AM, Dimitris Kontokostas <jimkont@gmail.com> wrote:
DimitrisBest,This is very easy for you to support (no need for extra hardware, just an extra table / index) and suited best for most use cases.So when we ask for updates after time X it returns a list of articles with modification date after X, no matter how many times they were edited in between.Dear Jeremy, all,This OAI-PMH implementation [1] is very convenient for what we need (and I guess for most people) because it uses the latest modification date for update publishing.
In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH protocol to get update feeds for English, German & Dutch WIkipedia.
What most people need in the end is to know which pages changed since time X. Fine grained details are for special type of clients.--On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann <hellmann@informatik.uni-leipzig.de> wrote:
Dear Jeremy,
please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :
* A dispatcher needs about 3 seconds to dispatch 1000 changes to a client wiki.
* Considering we have ~300 client wikis, this means one dispatcher can handle
about 4000 changes per hour.
* We currently have two dispatchers running in parallel (on a single box, hume),
that makes a capacity of 8000 changes/hour.
* We are seeing roughly 17000 changes per hour on wikidata.org - more than twice
our dispatch capacity.
* I want to try running 6 dispatcher processes; that would give us the capacity
to handle 24000 changes per hour (assuming linear scaling).
1. Somebody needs to run the Hub and it needs to scale. Looks like the protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?
2. Somebody will still need to run and maintain the Hub and feed all clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.
So we should use IRC RC + http request to the changed page as fallback?
Sebastian
Am 26.04.2013 08:06, schrieb Jeremy Baron:
Hi,
On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
<hellmann@informatik.uni-leipzig.de> wrote:
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:Capacity for what? the infrastructure should be not be a problem.
1. whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm)
(famous last words, can look more closely tomorrow. but I'm really not
worried about it) And you don't need any infrastructure at all for
development; just use one of google's public instances.
2. whether performance is good enough to handle high-volume publishersAgain, how do you mean?
Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions.I believe just about everyone uses the IRC feed from
The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.
What is the best practice to get updates from Wikipedia at the moment?
irc.wikimedia.org.
https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds
I imagine wikidata will or maybe already does propagate changes to a
channel on that server but I can imagine IRC would not be a good
method for many Instant data repo users. Some will not be able to
sustain a single TCP connection for extended periods, some will not be
able to use IRC ports at all, and some may go offline periodically.
e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
problems and is better than the current IRC solution in just about
every way.
We could potentially even replace the current cross-DB job queue
insert crazyness with PubSubHubbub for use on the cluster internally.
-Jeremy
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Kontokostas Dimitris
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l