[Wikidata-l] WikiData change propagation for third parties

List overview All Threads
Download

newer

older

[Wikidata-l] Finding Wikipedia...

[Wikidata-l] weekly summary #56

Hady elsahar

26 Apr 2013 26 Apr '13

4:12 a.m.

Hello All ,

i'm planning to write a proposal for WikiData to DBpedia project in GSoC2013 i've found in the change propagation paghttp://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagatione :

Support for 3rd party clients, that is, client wikis and other consumers

...

outside of Wikimedia, is currently not essential and will not be implemented for now. It shall however be kept in mind for all design decisions.

and i wanted to know two things :

1- what would be the time frame for change propagation to be ready , even a rough estimation ? could it be ready within 2 or 3 months ? 2- is there any design pattern or a brief outline for the change propagation design , how it would be ? in order that i could make a rough plan and estimation about how it could be consumed from the DBpedia side ?

thanks regards

------------------------------------------------- Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile Universityhttp://nileuniversity.edu.eg/

email : hadyelsahar@gmail.com Phone : +2-01220887311 http://hadyelsahar.me/

http://www.linkedin.com/in/hadyelsahar

Attachments:

attachment.htm (text/html — 2.4 KB)

Show replies by date

Jeremy Baron

26 Apr 26 Apr

4:20 a.m.

On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar hadyelsahar@gmail.com wrote:

...

2- is there any design pattern or a brief outline for the change propagation design , how it would be ? in order that i could make a rough plan and estimation about how it could be consumed from the DBpedia side ?

I don't know anything about the plan for this but it seems at first glance like a good place to use [[w:PubSubHubbub]].

-Jeremy

Hady elsahar

6:45 a.m.

Hello Dimirtis

what do you thing of that ? shall i write this part as an abstract part in the proposal and wait for more details , or could we have a smiliar plan like the one already implemented in dbpedia http://wiki.dbpedia.org/DBpediaLive#h156-3

thanks regards

On Fri, Apr 26, 2013 at 12:50 AM, Jeremy Baron jeremy@tuxmachine.comwrote:

...

On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar hadyelsahar@gmail.com wrote:

...
2- is there any design pattern or a brief outline for the change

propagation design , how it would be ? in order that i could make a rough plan and estimation about how it could be consumed from the DBpedia side ?

I don't know anything about the plan for this but it seems at first glance like a good place to use [[w:PubSubHubbub]].

-Jeremy

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- ------------------------------------------------- Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile Universityhttp://nileuniversity.edu.eg/ email : hadyelsahar@gmail.com Phone : +2-01220887311 http://hadyelsahar.me/ http://www.linkedin.com/in/hadyelsahar

Sebastian Hellmann

10:59 a.m.

Well, PubSubHubbub is a nice idea. However it clearly depends on two factors: 1. whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm) 2. whether performance is good enough to handle high-volume publishers

Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment? We are still using OAI-PMH...

In DBpedia, we use a simple self-created protocol: http://wiki.dbpedia.org/DBpediaLive#h156-4

...

/Publication of changesets/: Upon modifications old triples are replaced with updated triples. Those added and/or deleted triples are also written as N-Triples files and then compressed. Any client application or DBpedia-Live mirror can download those files and integrate and, hence, update a local copy of DBpedia. This enables that application to always in synchronization with our DBpedia-Live.

This could also work for Wikidata facts, right?

Other useful links: - http://www.openarchives.org/rs/0.5/resourcesync - http://www.sdshare.org/ - http://www.w3.org/community/sdshare/ - http://www.rabbitmq.com/

All the best, Sebastian

[1] https://www.wikidata.org/w/index.php?title=Special:RecentChanges&feed=at...

Am 26.04.2013 03:15, schrieb Hady elsahar:

...

Hello Dimirtis

what do you thing of that ? shall i write this part as an abstract part in the proposal and wait for more details , or could we have a smiliar plan like the one already implemented in dbpedia http://wiki.dbpedia.org/DBpediaLive#h156-3

thanks regards

On Fri, Apr 26, 2013 at 12:50 AM, Jeremy Baron <jeremy@tuxmachine.com mailto:jeremy@tuxmachine.com> wrote:
On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar
<hadyelsahar@gmail.com <mailto:hadyelsahar@gmail.com>> wrote:
> 2- is there any design pattern or a  brief outline for the
change propagation design , how it would be ? in order that i
could make a rough plan and estimation about how it could be
consumed from the DBpedia side ?

I don't know anything about the plan for this but it seems at first
glance like a good place to use [[w:PubSubHubbub]].

-Jeremy

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--

Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile University http://nileuniversity.edu.eg/

email : hadyelsahar@gmail.com mailto:hadyelsahar@gmail.com Phone : +2-01220887311 tel:%2B2-01220887311 http://hadyelsahar.me/

http://www.linkedin.com/in/hadyelsahar

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Jeremy Baron

11:36 a.m.

Hi,

On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:

...

Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:

whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances.

...

whether performance is good enough to handle high-volume publishers

Again, how do you mean?

...

Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way.

We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

Sebastian Hellmann

12:10 p.m.

Dear Jeremy, please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

...

A dispatcher needs about 3 seconds to dispatch 1000 changes to a client wiki.

Considering we have ~300 client wikis, this means one dispatcher can handle

about 4000 changes per hour.

We currently have two dispatchers running in parallel (on a single box, hume),

that makes a capacity of 8000 changes/hour.

We are seeing roughly 17000 changes per hour on wikidata.org - more than twice

our dispatch capacity.

I want to try running 6 dispatcher processes; that would give us the capacity

to handle 24000 changes per hour (assuming linear scaling).

1. Somebody needs to run the Hub and it needs to scale. Looks like the protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?

2. Somebody will still need to run and maintain the Hub and feed all clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

...

Hi,

On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:

...
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:

whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances.

...

whether performance is good enough to handle high-volume publishers

Again, how do you mean?

...
Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way.

We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Dimitris Kontokostas

12:43 p.m.

Dear Jeremy, all,

In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH protocol to get update feeds for English, German & Dutch WIkipedia. This OAI-PMH implementation [1] is very convenient for what we need (and I guess for most people) because it uses the latest modification date for update publishing. So when we ask for updates after time X it returns a list of articles with modification date after X, no matter how many times they were edited in between.

This is very easy for you to support (no need for extra hardware, just an extra table / index) and suited best for most use cases. What most people need in the end is to know which pages changed since time X. Fine grained details are for special type of clients.

Best, Dimitris

[1] http://www.mediawiki.org/wiki/Extension:OAIRepository

On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> wrote:

...

Dear Jeremy, please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

A dispatcher needs about 3 seconds to dispatch 1000 changes to a client

...
wiki.

Considering we have ~300 client wikis, this means one dispatcher can

handle about 4000 changes per hour.

We currently have two dispatchers running in parallel (on a single box,

hume), that makes a capacity of 8000 changes/hour.

We are seeing roughly 17000 changes per hour on wikidata.org - more

than twice our dispatch capacity.

I want to try running 6 dispatcher processes; that would give us the

capacity to handle 24000 changes per hour (assuming linear scaling).

Somebody needs to run the Hub and it needs to scale. Looks like the

protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?

Somebody will still need to run and maintain the Hub and feed all

clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

Hi,

...
On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann <hellmann@informatik.uni-**leipzig.dehellmann@informatik.uni-leipzig.de> wrote:

...
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:

whether Wikidata sets up such an infrastructure (I need to check

whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances.

whether performance is good enough to handle high-volume publishers

...
Again, how do you mean?

Basically, polling to recent changes [1] and then do a http request to

...
the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/**wiki/IRC/Channels#Raw_feeds https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way.

We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Kontokostas Dimitris

Yuri Astrakhan

4:32 p.m.

Recently I spoke with Wikia, and being able to subscribe to the recent changes feed is a very important feature to them. Apparently polling API's recent changes creates a much higher stress on the system than subscribing.

Now, we don't need (from the start) to implement publishing of all the data - just the fact that certain items have changed, and they can later be requested by usual means, but it would be good to implement this system for all of the API, not just wikidata.

On Fri, Apr 26, 2013 at 3:13 AM, Dimitris Kontokostas jimkont@gmail.comwrote:

...

Dear Jeremy, all,

In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH protocol to get update feeds for English, German & Dutch WIkipedia. This OAI-PMH implementation [1] is very convenient for what we need (and I guess for most people) because it uses the latest modification date for update publishing. So when we ask for updates after time X it returns a list of articles with modification date after X, no matter how many times they were edited in between.

This is very easy for you to support (no need for extra hardware, just an extra table / index) and suited best for most use cases. What most people need in the end is to know which pages changed since time X. Fine grained details are for special type of clients.

Best, Dimitris

[1] http://www.mediawiki.org/wiki/Extension:OAIRepository

On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> wrote:

...
Dear Jeremy, please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

A dispatcher needs about 3 seconds to dispatch 1000 changes to a

...
client wiki.

Considering we have ~300 client wikis, this means one dispatcher can

handle about 4000 changes per hour.

We currently have two dispatchers running in parallel (on a single

box, hume), that makes a capacity of 8000 changes/hour.

We are seeing roughly 17000 changes per hour on wikidata.org - more

than twice our dispatch capacity.

I want to try running 6 dispatcher processes; that would give us the

capacity to handle 24000 changes per hour (assuming linear scaling).

Somebody needs to run the Hub and it needs to scale. Looks like the

protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?

Somebody will still need to run and maintain the Hub and feed all

clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

Hi,

...
On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann <hellmann@informatik.uni-**leipzig.dehellmann@informatik.uni-leipzig.de> wrote:

...
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:

whether Wikidata sets up such an infrastructure (I need to check

whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances.

whether performance is good enough to handle high-volume publishers

...
Again, how do you mean?

Basically, polling to recent changes [1] and then do a http request to

...
the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/**wiki/IRC/Channels#Raw_feeds https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way.

We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Kontokostas Dimitris

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Denny Vrandečić

8:26 p.m.

The third party propagation is not very high on our priority list. Not because it is not important, but because there are things that are even more important - like getting it to work for Wikipedia :) And this seems to be stabilizing.

What we have, for now:

* We have the broadcast of all edits through IRC.

* One could poll recent changes, but with 200-450 edits per minute, this might get problematic.

* We do have the OAIRepository extension installed on Wikidata. Did anyone try that?

Besides that, we are currently moving our dispatches all to Redis, which has built-in-support for PubSubHubbub, so we will probably have some support for that at some point. I cannot make promises with regards to timeline of that, though. It is still in implementation, and needs to be fully tested and deployed, and after that it might have some rough edges still. So, it *could* be there in two to three months, but I cannot promise that.

The other three options are not sufficient?

Cheers, Denny

2013/4/26 Yuri Astrakhan yastrakhan@wikimedia.org

...

Recently I spoke with Wikia, and being able to subscribe to the recent changes feed is a very important feature to them. Apparently polling API's recent changes creates a much higher stress on the system than subscribing.

Now, we don't need (from the start) to implement publishing of all the data - just the fact that certain items have changed, and they can later be requested by usual means, but it would be good to implement this system for all of the API, not just wikidata.

On Fri, Apr 26, 2013 at 3:13 AM, Dimitris Kontokostas jimkont@gmail.comwrote:

...
Dear Jeremy, all,

In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH protocol to get update feeds for English, German & Dutch WIkipedia. This OAI-PMH implementation [1] is very convenient for what we need (and I guess for most people) because it uses the latest modification date for update publishing. So when we ask for updates after time X it returns a list of articles with modification date after X, no matter how many times they were edited in between.

This is very easy for you to support (no need for extra hardware, just an extra table / index) and suited best for most use cases. What most people need in the end is to know which pages changed since time X. Fine grained details are for special type of clients.

Best, Dimitris

[1] http://www.mediawiki.org/wiki/Extension:OAIRepository

On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> wrote:

...
Dear Jeremy, please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

A dispatcher needs about 3 seconds to dispatch 1000 changes to a

...
client wiki.

Considering we have ~300 client wikis, this means one dispatcher can

handle about 4000 changes per hour.

We currently have two dispatchers running in parallel (on a single

box, hume), that makes a capacity of 8000 changes/hour.

We are seeing roughly 17000 changes per hour on wikidata.org - more

than twice our dispatch capacity.

I want to try running 6 dispatcher processes; that would give us the

capacity to handle 24000 changes per hour (assuming linear scaling).

Somebody needs to run the Hub and it needs to scale. Looks like the

protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?

Somebody will still need to run and maintain the Hub and feed all

clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

Hi,

...
On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann <hellmann@informatik.uni-**leipzig.dehellmann@informatik.uni-leipzig.de> wrote:

...
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:

whether Wikidata sets up such an infrastructure (I need to check

whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances.

whether performance is good enough to handle high-volume publishers

...
Again, how do you mean?

Basically, polling to recent changes [1] and then do a http request to

...
the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/**wiki/IRC/Channels#Raw_feeds https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way.

We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Kontokostas Dimitris

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Dimitris Kontokostas

8:39 p.m.

Hi Denny

On Fri, Apr 26, 2013 at 5:56 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:

...

The third party propagation is not very high on our priority list. Not because it is not important, but because there are things that are even more important - like getting it to work for Wikipedia :) And this seems to be stabilizing.

What we have, for now:

We have the broadcast of all edits through IRC.

One could poll recent changes, but with 200-450 edits per minute, this

might get problematic.

We do have the OAIRepository extension installed on Wikidata. Did anyone

try that?

Great! Didn't know that. I see it installed ( http://www.wikidata.org/wiki/Special:OAIRepository) but it is password protected, can we (DBpedia) request access?

Cheers, Dimitris

...

Besides that, we are currently moving our dispatches all to Redis, which has built-in-support for PubSubHubbub, so we will probably have some support for that at some point. I cannot make promises with regards to timeline of that, though. It is still in implementation, and needs to be fully tested and deployed, and after that it might have some rough edges still. So, it *could* be there in two to three months, but I cannot promise that.

The other three options are not sufficient?

Cheers, Denny

2013/4/26 Yuri Astrakhan yastrakhan@wikimedia.org

...
Recently I spoke with Wikia, and being able to subscribe to the recent changes feed is a very important feature to them. Apparently polling API's recent changes creates a much higher stress on the system than subscribing.

Now, we don't need (from the start) to implement publishing of all the data - just the fact that certain items have changed, and they can later be requested by usual means, but it would be good to implement this system for all of the API, not just wikidata.

On Fri, Apr 26, 2013 at 3:13 AM, Dimitris Kontokostas jimkont@gmail.comwrote:

...
Dear Jeremy, all,

In addition to what Sebastian said, in DBpedia Live we use the OAI-PMH protocol to get update feeds for English, German & Dutch WIkipedia. This OAI-PMH implementation [1] is very convenient for what we need (and I guess for most people) because it uses the latest modification date for update publishing. So when we ask for updates after time X it returns a list of articles with modification date after X, no matter how many times they were edited in between.

This is very easy for you to support (no need for extra hardware, just an extra table / index) and suited best for most use cases. What most people need in the end is to know which pages changed since time X. Fine grained details are for special type of clients.

Best, Dimitris

[1] http://www.mediawiki.org/wiki/Extension:OAIRepository

On Fri, Apr 26, 2013 at 9:40 AM, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> wrote:

...
Dear Jeremy, please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :

A dispatcher needs about 3 seconds to dispatch 1000 changes to a

...
client wiki.

Considering we have ~300 client wikis, this means one dispatcher can

handle about 4000 changes per hour.

We currently have two dispatchers running in parallel (on a single

box, hume), that makes a capacity of 8000 changes/hour.

We are seeing roughly 17000 changes per hour on wikidata.org - more

than twice our dispatch capacity.

I want to try running 6 dispatcher processes; that would give us the

capacity to handle 24000 changes per hour (assuming linear scaling).

Somebody needs to run the Hub and it needs to scale. Looks like the

protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this?

Somebody will still need to run and maintain the Hub and feed all

clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity.

So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

Hi,

...
On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann <hellmann@informatik.uni-**leipzig.dehellmann@informatik.uni-leipzig.de> wrote:

...
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:

whether Wikidata sets up such an infrastructure (I need to check

whether we have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances.

whether performance is good enough to handle high-volume publishers

...
Again, how do you mean?

Basically, polling to recent changes [1] and then do a http request

...
to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/**wiki/IRC/Channels#Raw_feeds https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way.

We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Kontokostas Dimitris

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Kontokostas Dimitris

Daniel Kinzler

9:31 p.m.

On 26.04.2013 17:09, Dimitris Kontokostas wrote:

...

* We do have the OAIRepository extension installed on Wikidata. Did anyone
try that?
Great! Didn't know that. I see it installed (http://www.wikidata.org/wiki/Special:OAIRepository) but it is password protected, can we (DBpedia) request access?

Sure, but you already have access, DBpedia live uses it. The password for Wikidata is the same as for Wikipedia (I don't remember it...).

You guys are the only reason the interface still exists :) DBpedia is the only (regular) external user (LuceneSearch is the only internal user).

Note that there's nobody really maintaining this interface, so finding an alternative would be great. Or deciding we (or more precisely, the Wikimedia FOundation - there's not much the Wikidata team can do there) really want to support OAI in the future, and then overhaul the implementation.

-- daniel

-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Sebastian Hellmann

27 Apr 27 Apr

12:43 a.m.

Hi Daniel,

Am 26.04.2013 18:01, schrieb Daniel Kinzler:

...

You guys are the only reason the interface still exists :) DBpedia is the only (regular) external user (LuceneSearch is the only internal user). Note that there's nobody really maintaining this interface, so finding an alternative would be great. Or deciding we (or more precisely, the Wikimedia FOundation - there's not much the Wikidata team can do there) really want to support OAI in the future, and then overhaul the implementation. -- daniel

Actually, we asked quite often about where to change to and what would be the best way for us to create a live mirror. We just never received an answer... We were afraid to pound the Wikipedia API with 150k request per day (this is the number of edits on some days), because we were afraid of getting IP-blocked. Also there was no official clearance that we may do so.

If you are telling us now that IRC is no good, what other way is there to create a live in sync mirror? -- Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Deadline: *July 8th*) Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Daniel Kinzler

4 May 4 May

8:33 p.m.

On 26.04.2013 21:13, Sebastian Hellmann wrote:

...

Hi Daniel,

Am 26.04.2013 18:01, schrieb Daniel Kinzler:

...
You guys are the only reason the interface still exists :) DBpedia is the only (regular) external user (LuceneSearch is the only internal user). Note that there's nobody really maintaining this interface, so finding an alternative would be great. Or deciding we (or more precisely, the Wikimedia FOundation - there's not much the Wikidata team can do there) really want to support OAI in the future, and then overhaul the implementation. -- daniel

Actually, we asked quite often about where to change to and what would be the best way for us to create a live mirror. We just never received an answer...

Yea, that's the issue: the OAI interface is still up, but it's pretty much unsupported. But there is no alternative as far as I know. It seems like everyone wants PubHub for this, but as far as I know, nobody is working on it.

I think it's fine for you to keep using OAI for now. Just be aware that once the WMF moves search to solar, you will be the *only* user of the OAI interface... so keep in touch with the Foundation about it.

-- daniel

Daniel Kinzler

26 Apr 26 Apr

8:45 p.m.

On 26.04.2013 16:56, Denny Vrandečić wrote:

...

The third party propagation is not very high on our priority list. Not because it is not important, but because there are things that are even more important - like getting it to work for Wikipedia :) And this seems to be stabilizing.

What we have, for now:

We have the broadcast of all edits through IRC.

This interface is quite unreliable, the output can't be parsed in an unambiguous way, and may get truncated. I did implement notifications via XMPP several years ago, but it never went beyond a proof of concept. Have a look at the XMLRC extension if you are interested.

...

One could poll recent changes, but with 200-450 edits per minute, this might

get problematic.

Well, polling isn't really the problem, fetching all the content is. And you'd need to do that no matter how you get the information of what has changed.

...

We do have the OAIRepository extension installed on Wikidata. Did anyone try that?

In principle that is a decent update interface, but I'd recommend not to use OAI before we have implemented feature 47714 ("Support RDF and API serializations of entity data via OAI-MPH"). Right now, what you'd get from there would be our *internal* JSON representation, which is different from what the API returns, and may change at any time without notice.

-- daniel

-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Dimitris Kontokostas

9:01 p.m.

Hi Daniel,

On Fri, Apr 26, 2013 at 6:15 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

On 26.04.2013 16:56, Denny Vrandečić wrote:

...
The third party propagation is not very high on our priority list. Not

because

...
it is not important, but because there are things that are even more

important -

...
like getting it to work for Wikipedia :) And this seems to be

stabilizing.

...
What we have, for now:

We have the broadcast of all edits through IRC.

This interface is quite unreliable, the output can't be parsed in an unambiguous way, and may get truncated. I did implement notifications via XMPP several years ago, but it never went beyond a proof of concept. Have a look at the XMLRC extension if you are interested.

...

One could poll recent changes, but with 200-450 edits per minute, this

might

...
get problematic.

Well, polling isn't really the problem, fetching all the content is. And you'd need to do that no matter how you get the information of what has changed.

...

We do have the OAIRepository extension installed on Wikidata. Did

anyone try that?

In principle that is a decent update interface, but I'd recommend not to use OAI before we have implemented feature 47714 ("Support RDF and API serializations of entity data via OAI-MPH"). Right now, what you'd get from there would be our *internal* JSON representation, which is different from what the API returns, and may change at any time without notice.

What we do right now in DBpedia Live is that we have a local clone of Wikipedia that get's in sync using the OAIRepository extension. This is done to abuse our local copy as we please.

The local copy also publishes updates with OAI-PMH that we use to get the list of modified page ids. Once we get the page ids, we use the normal mediawiki api to fetch the actual page content. So, feature 47714 should not be a problem in our case since we don't need the data serialized directly from OAI-PMH

Cheers, Dimitris

...

-- daniel

-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Kontokostas Dimitris

Daniel Kinzler

9:28 p.m.

On 26.04.2013 17:31, Dimitris Kontokostas wrote:

...

What we do right now in DBpedia Live is that we have a local clone of Wikipedia that get's in sync using the OAIRepository extension. This is done to abuse our local copy as we please.

It would be owesome if this Just Worked (tm) for Wikidata too, but i highly doubt it. You can use the OAI interface to get (unstable) data from Wikidata, but I don't think magic import from OAI will work. Generally, importing Wikidata entities into another wiki is problematic, because of entity IDs and uniquenes constraints. If the target wiki is perfectly in sync, it might work...

Are you going to try this? Would be great if you could give us feedback!

-- daniel

-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Jona Christopher Sahnwaldt

4 May 4 May

3:35 p.m.

On 26 April 2013 17:15, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

On 26.04.2013 16:56, Denny Vrandečić wrote:

...
The third party propagation is not very high on our priority list. Not because it is not important, but because there are things that are even more important - like getting it to work for Wikipedia :) And this seems to be stabilizing.

What we have, for now:

We have the broadcast of all edits through IRC.

This interface is quite unreliable, the output can't be parsed in an unambiguous way, and may get truncated. I did implement notifications via XMPP several years ago, but it never went beyond a proof of concept. Have a look at the XMLRC extension if you are interested.

...

One could poll recent changes, but with 200-450 edits per minute, this might

get problematic.

Well, polling isn't really the problem, fetching all the content is. And you'd need to do that no matter how you get the information of what has changed.

...

We do have the OAIRepository extension installed on Wikidata. Did anyone try that?

In principle that is a decent update interface, but I'd recommend not to use OAI before we have implemented feature 47714 ("Support RDF and API serializations of entity data via OAI-MPH"). Right now, what you'd get from there would be our *internal* JSON representation, which is different from what the API returns, and may change at any time without notice.

Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers...

- How many are there? Just the two, internal and external? - Which JSON representations do the API and the XML dump provide? Will they do so in the future? - Are the API and XML dump representations stable? Or should we expect some changes?

...

-- daniel

-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Daniel Kinzler

8:42 p.m.

On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote:

...

On 26 April 2013 17:15, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...
*internal* JSON representation, which is different from what the API returns, and may change at any time without notice.

Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers...

How many are there? Just the two, internal and external?

Yes, these two.

...

Which JSON representations do the API and the XML dump provide? Will

they do so in the future?

The XML dump provides the internal representations (since it's a dump of the raw page content). The API uses the external representation.

This is pretty much dictated by the nature of the dumps and the API, so it will stay that way. However, we plan to add more types of dumps, including:

* a plain JSON dump (using the external representation) * an RDF/XML dump

It's not sure yet when or even if we'll provide these, but we are considering it.

...

Are the API and XML dump representations stable? Or should we expect

some changes?

The internal representation is unstable and subject to changes without notice. In fact, it may even change to something other than JSON. I don't think it's even documented anywhere outside the source code.

The external representation is pretty stable, though not final yet. We will definitely make additions to it, and some (hopefully minor) structural changes may be necessary. We'll try to stay largely backwards compatible, but can't promise full stability yet.

Also, the external representation uses the API framework for generating the actual JSON, and may be subject to changes imposed by that framework.

Unfortunately, this means that there are currently no dumps with a reliable representation of our data. You need to a) use the API or b) use the unstable internal JSON or c) wait for "real" data dumps.

-- daniel

Jona Christopher Sahnwaldt

10:43 p.m.

On 4 May 2013 17:12, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote:

...
On 26 April 2013 17:15, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...
*internal* JSON representation, which is different from what the API returns, and may change at any time without notice.

Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers...

How many are there? Just the two, internal and external?

Yes, these two.

...

Which JSON representations do the API and the XML dump provide? Will

they do so in the future?

The XML dump provides the internal representations (since it's a dump of the raw page content). The API uses the external representation.

This is pretty much dictated by the nature of the dumps and the API, so it will stay that way. However, we plan to add more types of dumps, including:

a plain JSON dump (using the external representation)

an RDF/XML dump

It's not sure yet when or even if we'll provide these, but we are considering it.

...

Are the API and XML dump representations stable? Or should we expect

some changes?

The internal representation is unstable and subject to changes without notice. In fact, it may even change to something other than JSON. I don't think it's even documented anywhere outside the source code.

The external representation is pretty stable, though not final yet. We will definitely make additions to it, and some (hopefully minor) structural changes may be necessary. We'll try to stay largely backwards compatible, but can't promise full stability yet.

Also, the external representation uses the API framework for generating the actual JSON, and may be subject to changes imposed by that framework.

Unfortunately, this means that there are currently no dumps with a reliable representation of our data. You need to a) use the API or b) use the unstable internal JSON or c) wait for "real" data dumps.

Thanks for the clarification. Not the best news, but not terribly bad either.

We will produce a DBpedia release pretty soon, I don't think we can wait for the "real" dumps. The inter-language links are an important part of DBpedia, so we have to extract data from almost all Wikidata items. I don't think it's sensible to make ~10 million calls to the API to download the external JSON format, so we will have to use the XML dumps and thus the internal format. But I think it's not a big deal that it's not that stable: we parse the JSON into an AST anyway. It just means that we will have to use a more abstract AST, which I was planning to do anyway. As long as the semantics of the internal format will remain more or less the same - it will contain the labels, the language links, the properties, etc. - it's no big deal if the syntax changes, even if it's not JSON anymore.

Christopher

...

-- daniel

Daniel Kinzler

5 May 5 May

3:06 a.m.

On 04.05.2013 19:13, Jona Christopher Sahnwaldt wrote:

...

We will produce a DBpedia release pretty soon, I don't think we can wait for the "real" dumps. The inter-language links are an important part of DBpedia, so we have to extract data from almost all Wikidata items. I don't think it's sensible to make ~10 million calls to the API to download the external JSON format, so we will have to use the XML dumps and thus the internal format.

Oh, if it's just the language links, this isn't an issue: there's an additional table for them in the database, and we'll soon be providing a separate dump of that at table http://dumps.wikimedia.org/wikidatawiki/

If it's not there when you need it, just ask us for a dump of the sitelinks table (technically, wb_items_per_site), and we'll get you one.

...

But I think it's not a big deal that it's not that stable: we parse the JSON into an AST anyway. It just means that we will have to use a more abstract AST, which I was planning to do anyway. As long as the semantics of the internal format will remain more or less the same - it will contain the labels, the language links, the properties, etc. - it's no big deal if the syntax changes, even if it's not JSON anymore.

Yes, if you want the labels and properties in addition to the links, you'll have to do that for now. But I'm working on the "real" data dumps.

-- daniel

4224

Age (days ago)

4233

Last active (days ago)

wikidata@lists.wikimedia.org

19 comments

8 participants

tags (0)

participants (8)

Daniel Kinzler
Denny Vrandečić
Dimitris Kontokostas
Hady elsahar
Jeremy Baron
Jona Christopher Sahnwaldt
Sebastian Hellmann
Yuri Astrakhan