Re: [Wikidata-l] WikiData change propagation for third parties

26 Apr 2013

Dear Jeremy,
please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

...
  * A dispatcher needs about 3 seconds to dispatch 1000
changes to a client wiki.
 * Considering we have ~300 client wikis, this means one dispatcher can handle
 about 4000 changes per hour.
 * We currently have two dispatchers running in parallel (on a single box, hume),
 that makes a capacity of 8000 changes/hour.
 * We are seeing roughly 17000 changes per hour on wikidata.org - more than twice
 our dispatch capacity.
 * I want to try running 6 dispatcher processes; that would give us the capacity
 to handle 24000 changes per hour (assuming linear scaling). 
1.  Somebody needs to run the Hub and it needs to scale. Looks like the 
protocol was intended to save some traffic, not to dispatch a massive 
amount of messages / per day to a large number of clients. Again, I am 
not familiar, how efficient PubSubHubbub is. What kind of hardware is 
needed to run this, effectively? Do you have experience with this?

2. Somebody will still need to run and maintain the Hub and feed all 
clients. I was offering to host one of the hubs for DBpedia users, but I 
am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:
...
    Hi,

 On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
 &lt;hellmann(a)informatik.uni-leipzig.de&gt; wrote:
  Well, PubSubHubbub is a nice idea. However it
clearly depends on two factors:
 1. whether Wikidata sets up such an infrastructure (I need to check whether we have
capacities, I am not sure atm)  Capacity for what? the infrastructure should be not
be a problem.
 (famous last words, can look more closely tomorrow. but I'm really not
 worried about it) And you don't need any infrastructure at all for
 development; just use one of google's public instances.

  2. whether performance is good enough to handle
high-volume publishers  Again, how do you mean?

  Basically, polling to recent changes [1] and then
do a http request to the individual pages should be fine for a start. So I guess this is
what we will implement, if there aren't any better suggestions.
 The whole issue is problematic and the DBpedia project would be happy, if this were
discussed and decided right now, so we can plan development.

 What is the best practice to get updates from Wikipedia at the moment?  I believe
just about everyone uses the IRC feed from
 irc.wikimedia.org.
 https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

 I imagine wikidata will or maybe already does propagate changes to a
 channel on that server but I can imagine IRC would not be a good
 method for many Instant data repo users. Some will not be able to
 sustain a single TCP connection for extended periods, some will not be
 able to use IRC ports at all, and some may go offline periodically.
 e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
 problems and is better than the current IRC solution in just about
 every way.

 We could potentially even replace the current cross-DB job queue
 insert crazyness with PubSubHubbub for use on the cluster internally.

 -Jeremy

 _______________________________________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] WikiData change propagation for third parties