Re: [Wikidata-l] WikiData change propagation for third parties

26 Apr 2013

Hi Denny

On Fri, Apr 26, 2013 at 5:56 PM, Denny Vrandečić <
denny.vrandecic(a)wikimedia.de&gt; wrote:

...
  The third party propagation is not very high on our
priority list. Not
 because it is not important, but because there are things that are even
 more important - like getting it to work for Wikipedia :) And this seems to
 be stabilizing.

 What we have, for now:

 * We have the broadcast of all edits through IRC.

 * One could poll recent changes, but with 200-450 edits per minute, this
 might get problematic.

 * We do have the OAIRepository extension installed on Wikidata. Did anyone
 try that?

Great! Didn't know that. I see it installed (
http://www.wikidata.org/wiki/Special:OAIRepository) but it is password
protected, can we (DBpedia) request access?

Cheers,
Dimitris

...

 Besides that, we are currently moving our dispatches all to Redis, which
 has built-in-support for PubSubHubbub, so we will probably have some
 support for that at some point. I cannot make promises with regards to
 timeline of that, though. It is still in implementation, and needs to be
 fully tested and deployed, and after that it might have some rough edges
 still. So, it *could* be there in two to three months, but I cannot promise
 that.

 The other three options are not sufficient?

 Cheers,
 Denny

 2013/4/26 Yuri Astrakhan &lt;yastrakhan(a)wikimedia.org&gt;

  Recently I spoke with Wikia, and being able to
subscribe to the recent
 changes feed is a very important feature to them. Apparently polling API's
 recent changes creates a much higher stress on the system than subscribing.

 Now, we don't need (from the start)  to implement publishing of all the
 data - just the fact that certain items have changed, and they can later be
 requested by usual means, but it would be good to implement this system for
 all of the API, not just wikidata.

 On Fri, Apr 26, 2013 at 3:13 AM, Dimitris Kontokostas &lt;jimkont(a)gmail.com&gt;wrote;wrote:

  Dear Jeremy, all,

 In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH
 protocol to get update feeds for English, German & Dutch WIkipedia.
 This OAI-PMH implementation [1] is very convenient for what we need (and
 I guess for most people) because it uses the latest modification date for
 update publishing.
 So when we ask for updates after time X it returns a list of articles
 with modification date after X, no matter how many times they were edited
 in between.

 This is very easy for you to support (no need for extra hardware, just
 an extra table / index) and suited best for most use cases.
 What most people need in the end is to know which pages changed since
 time X. Fine grained details are for special type of clients.

 Best,
 Dimitris

 [1] http://www.mediawiki.org/wiki/Extension:OAIRepository

 On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann <
 hellmann(a)informatik.uni-leipzig.de&gt; wrote:

  Dear Jeremy,
 please read email from Daniel Kinzler on this list from 26.03.2013 18:26
 :

  * A dispatcher needs about 3 seconds to dispatch 1000 changes to a
> client wiki.
> * Considering we have ~300 client wikis, this means one dispatcher can
> handle
> about 4000 changes per hour.
> * We currently have two dispatchers running in parallel (on a single
> box, hume),
> that makes a capacity of 8000 changes/hour.
> * We are seeing roughly 17000 changes per hour on wikidata.org - more
> than twice
> our dispatch capacity.
> * I want to try running 6 dispatcher processes; that would give us the
> capacity
> to handle 24000 changes per hour (assuming linear scaling).
>

 1.  Somebody needs to run the Hub and it needs to scale. Looks like the
 protocol was intended to save some traffic, not to dispatch a massive
 amount of messages / per day to a large number of clients. Again, I am not
 familiar, how efficient PubSubHubbub is. What kind of hardware is needed to
 run this, effectively? Do you have experience with this?

 2. Somebody will still need to run and maintain the Hub and feed all
 clients. I was offering to host one of the hubs for DBpedia users, but I am
 not sure, whether we have that capacity.

 So we should use IRC RC + http request to the changed page as fallback?

 Sebastian

 Am 26.04.2013 08:06, schrieb Jeremy Baron:

    Hi,
>
> On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
>
<hellmann@informatik.uni-**leipzig.de<hellmann@informatik.uni-leipzig.de>>
> wrote:
>
>> Well, PubSubHubbub is a nice idea. However it clearly depends on two
>> factors:
>> 1. whether Wikidata sets up such an infrastructure (I need to check
>> whether we have capacities, I am not sure atm)
>>
> Capacity for what? the infrastructure should be not be a problem.
> (famous last words, can look more closely tomorrow. but I'm really not
> worried about it) And you don't need any infrastructure at all for
> development; just use one of google's public instances.
>
>  2. whether performance is good enough to handle high-volume publishers
>>
> Again, how do you mean?
>
>  Basically, polling to recent changes [1] and then do a http request
>> to the individual pages should be fine for a start. So I guess this is what
>> we will implement, if there aren't any better suggestions.
>> The whole issue is problematic and the DBpedia project would be
>> happy, if this were discussed and decided right now, so we can plan
>> development.
>>
>> What is the best practice to get updates from Wikipedia at the moment?
>>
> I believe just about everyone uses the IRC feed from
> irc.wikimedia.org.
>
https://meta.wikimedia.org/**wiki/IRC/Channels#Raw_feeds<https://meta.wi…
>
> I imagine wikidata will or maybe already does propagate changes to a
> channel on that server but I can imagine IRC would not be a good
> method for many Instant data repo users. Some will not be able to
> sustain a single TCP connection for extended periods, some will not be
> able to use IRC ports at all, and some may go offline periodically.
> e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
> problems and is better than the current IRC solution in just about
> every way.
>
> We could potentially even replace the current cross-DB job queue
> insert crazyness with PubSubHubbub for use on the cluster internally.
>
> -Jeremy
>
> ______________________________**_________________
> Wikidata-l mailing list
> Wikidata-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/**mailman/listinfo/wikidata-l<https://lists.…
>
>

 --
 Dipl. Inf. Sebastian Hellmann
 Department of Computer Science, University of Leipzig
 Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
 http://dbpedia.org/Wiktionary , http://dbpedia.org
 Homepage:
http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.info…
 Research Group: http://aksw.org

 ______________________________**_________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org

https://lists.wikimedia.org/**mailman/listinfo/wikidata-l<https://lists.…

 --
 Kontokostas Dimitris

 _______________________________________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

 _______________________________________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

 --
 Project director Wikidata
 Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
 Tel. +49-30-219 158 26-0 | http://wikimedia.de

 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
 Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
 der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
 Körperschaften I Berlin, Steuernummer 27/681/51985.

 _______________________________________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- 
Kontokostas Dimitris

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] WikiData change propagation for third parties