Request for Comments: Cross site data access for Wikidata

List overview All Threads
Download

newer

older

CIA IRC bot kicked out

Re: [Wikitech-l] Multiple image...

Daniel Kinzler

23 Apr 2012 23 Apr '12

8:45 a.m.

Hi all!

The wikidata team has been discussing how to best make data from wikidata available on local wikis. Fetching the data via HTTP whenever a page is re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a push-based architecture.

The proposal is at http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage, I have copied it below too.

Please have a lot and let us know if you think this is viable, and which of the two variants you deem better!

Thanks, -- daniel

PS: Please keep the discussion on wikitech-l, so we have it all in one place.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

== Proposal: HTTP push to local db storage ==

* Every time an item on Wikidata is changed, an HTTP push is issued to all subscribing clients (wikis) ** initially, "subscriptions" are just entries in an array in the configuration. ** Pushes can be done via the job queue. ** pushing is done via the mediawiki API, but other protocols such as PubSub Hubbub / AtomPub can easily be added to support 3rd parties. ** pushes need to be authenticated, so we don't get malicious crap. Pushes should be done using a special user with a special user right. ** the push may contain either the full set of information for the item, or just a delta (diff) + hash for integrity check (in case an update was missed).

* When the client receives a push, it does two things: *# write the fresh data into a local database table (the local wikidata cache) *# invalidate the (parser) cache for all pages that use the respective item (for now we can assume that we know this from the language links) *#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.

* when a page is rendered, interlanguage links and other info is taken from the local wikidata cache. No queries are made to wikidata during parsing/rendering.

* In case an update is missed, we need a mechanism to allow requesting a full purge and re-fetch of all data from on the client side and not just wait until the next push which might very well take a very long time to happen. ** There needs to be a manual option for when someone detects this. maybe action=purge can be made to do this. Simple cache-invalidation however shouldn't pull info from wikidata. **A time-to-live could be added to the local copy of the data so that it's updated by doing a pull periodically so the data does not stay stale indefinitely after a failed push.

=== Variation: shared database tables ===

Instead of having a local wikidata cache on each wiki (which may grow big - a first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all client wikis could access the same central database table(s) managed by the wikidata wiki.

* this is similar to the way the globalusage extension tracks the usage of commons images * whenever a page is re-rendered, the local wiki would query the table in the wikidata db. This means a cross-cluster db query whenever a page is rendered, instead a local query. * the HTTP push mechanism described above would still be needed to purge the parser cache when needed. But the push requests would not need to contain the updated data, they may just be requests to purge the cache. * the ability for full HTTP pushes (using the mediawiki API or some other interface) would still be desirable for 3rd party integration.

* This approach greatly lowers the amount of space used in the database * it doesn't change the number of http requests made ** it does however reduce the amount of data transferred via http (but not by much, at least not compared to pushing diffs) * it doesn't change the number of database requests, but it introduces cross-cluster requests

Show replies by date

Petr Bena

23 Apr 23 Apr

10:07 a.m.

I think it would be much better if the local wikis where it is supposed to access this would have some sort of client extension which would allow them to render the content using the db of wikidata. That would be much simpler and faster

On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Hi all!

The proposal is at http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage, I have copied it below too.

Please have a lot and let us know if you think this is viable, and which of the two variants you deem better!

Thanks, -- daniel

PS: Please keep the discussion on wikitech-l, so we have it all in one place.


== Proposal: HTTP push to local db storage ==

* Every time an item on Wikidata is changed, an HTTP push is issued to all
subscribing clients (wikis)
** initially, "subscriptions" are just entries in an array in the configuration.
** Pushes can be done via the job queue.
** pushing is done via the mediawiki API, but other protocols such as PubSub
Hubbub / AtomPub can easily be added to support 3rd parties.
** pushes need to be authenticated, so we don't get malicious crap. Pushes
should be done using a special user with a special user right.
** the push may contain either the full set of information for the item, or just
a delta (diff) + hash for integrity check (in case an update was missed).

* When the client receives a push, it does two things:
*# write the fresh data into a local database table (the local wikidata cache)
*# invalidate the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.

* when a page is rendered, interlanguage links and other info is taken from the
local wikidata cache. No queries are made to wikidata during parsing/rendering.

* In case an update is missed, we need a mechanism to allow requesting a full
purge and re-fetch of all data from on the client side and not just wait until
the next push which might very well take a very long time to happen.
** There needs to be a manual option for when someone detects this. maybe
action=purge can be made to do this. Simple cache-invalidation however shouldn't
pull info from wikidata.
**A time-to-live could be added to the local copy of the data so that it's
updated by doing a pull periodically so the data does not stay stale
indefinitely after a failed push.

=== Variation: shared database tables ===

Instead of having a local wikidata cache on each wiki (which may grow big - a
first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
client wikis could  access the same central database table(s) managed by the
wikidata wiki.

* this is similar to the way the globalusage extension tracks the usage of
commons images
* whenever a page is re-rendered, the local wiki would query the table in the
wikidata db. This means a cross-cluster db query whenever a page is rendered,
instead a local query.
* the HTTP push mechanism described above would still be needed to purge the
parser cache when needed. But the push requests would not need to contain the
updated data, they may just be requests to purge the cache.
* the ability for full HTTP pushes (using the mediawiki API or some other
interface) would still be desirable for 3rd party integration.

* This approach greatly lowers the amount of space used in the database
* it doesn't change the number of http requests made
** it does however reduce the amount of data transferred via http (but not by
much, at least not compared to pushing diffs)
* it doesn't change the number of database requests, but it introduces
cross-cluster requests



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Petr Bena

10:09 a.m.

I mean, in simple words:

Your idea: when the data on wikidata is changed the new content is pushed to all local wikis / somewhere

My idea: local wikis retrieve data from wikidata db directly, no need to push anything on change

On Mon, Apr 23, 2012 at 4:07 PM, Petr Bena benapetr@gmail.com wrote:

...

On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Hi all!

The proposal is at http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage, I have copied it below too.

Please have a lot and let us know if you think this is viable, and which of the two variants you deem better!

Thanks, -- daniel

PS: Please keep the discussion on wikitech-l, so we have it all in one place.


== Proposal: HTTP push to local db storage ==

* Every time an item on Wikidata is changed, an HTTP push is issued to all
subscribing clients (wikis)
** initially, "subscriptions" are just entries in an array in the configuration.
** Pushes can be done via the job queue.
** pushing is done via the mediawiki API, but other protocols such as PubSub
Hubbub / AtomPub can easily be added to support 3rd parties.
** pushes need to be authenticated, so we don't get malicious crap. Pushes
should be done using a special user with a special user right.
** the push may contain either the full set of information for the item, or just
a delta (diff) + hash for integrity check (in case an update was missed).

* When the client receives a push, it does two things:
*# write the fresh data into a local database table (the local wikidata cache)
*# invalidate the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.

* when a page is rendered, interlanguage links and other info is taken from the
local wikidata cache. No queries are made to wikidata during parsing/rendering.

* In case an update is missed, we need a mechanism to allow requesting a full
purge and re-fetch of all data from on the client side and not just wait until
the next push which might very well take a very long time to happen.
** There needs to be a manual option for when someone detects this. maybe
action=purge can be made to do this. Simple cache-invalidation however shouldn't
pull info from wikidata.
**A time-to-live could be added to the local copy of the data so that it's
updated by doing a pull periodically so the data does not stay stale
indefinitely after a failed push.

=== Variation: shared database tables ===

Instead of having a local wikidata cache on each wiki (which may grow big - a
first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
client wikis could  access the same central database table(s) managed by the
wikidata wiki.

* this is similar to the way the globalusage extension tracks the usage of
commons images
* whenever a page is re-rendered, the local wiki would query the table in the
wikidata db. This means a cross-cluster db query whenever a page is rendered,
instead a local query.
* the HTTP push mechanism described above would still be needed to purge the
parser cache when needed. But the push requests would not need to contain the
updated data, they may just be requests to purge the cache.
* the ability for full HTTP pushes (using the mediawiki API or some other
interface) would still be desirable for 3rd party integration.

* This approach greatly lowers the amount of space used in the database
* it doesn't change the number of http requests made
** it does however reduce the amount of data transferred via http (but not by
much, at least not compared to pushing diffs)
* it doesn't change the number of database requests, but it introduces
cross-cluster requests



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

10:22 a.m.

New subject: Request for Comments: Cross site data access for Wikidata

On 23.04.2012 16:09, Petr Bena wrote:

...

I mean, in simple words:

Your idea: when the data on wikidata is changed the new content is pushed to all local wikis / somewhere

My idea: local wikis retrieve data from wikidata db directly, no need to push anything on change

Well, the local wiki still needs to notice that something changed on wikidata. That would require some sort of push, even if that push is just a purge. So this would mean pushing *and* pulling, which makes things more complex instead of simpler. Or am I missing something?

Alternatively, once could poll for changes regularly. That's a ton of overhead, though: the majority of pages will need to be kept in sync with wikidata, because they have at least their languagelinks there.

-- daniel

Gerard Meijssen

2:21 p.m.

Hoi, One of the KEY reasons to have Wikidata is that is DOES update when there is a change in the data. For instance, how many Wikipedias have an article on my home town of Almere and have it said that Mrs Jorritsma is the mayor ... She will not be mayor forever ... There are many villages, towns and cities like Almere.

I do positively not like the idea of all the wasted effort when a pushy Wikidata can be and should be the solution. Thanks, Gerard

On 23 April 2012 16:09, Petr Bena benapetr@gmail.com wrote:

...

I mean, in simple words:

Your idea: when the data on wikidata is changed the new content is pushed to all local wikis / somewhere

My idea: local wikis retrieve data from wikidata db directly, no need to push anything on change

On Mon, Apr 23, 2012 at 4:07 PM, Petr Bena benapetr@gmail.com wrote:

...

I think it would be much better if the local wikis where it is supposed to access this would have some sort of client extension which would allow them to render the content using the db of wikidata. That would be much simpler and faster

On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler daniel@brightbyte.de

wrote:

...

...
Hi all!

The wikidata team has been discussing how to best make data from

wikidata

...

...
available on local wikis. Fetching the data via HTTP whenever a page is re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a push-based architecture.

The proposal is at <

http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal...

...

,

...
I have copied it below too.

Please have a lot and let us know if you think this is viable, and

which of the

...

...
two variants you deem better!

Thanks, -- daniel

PS: Please keep the discussion on wikitech-l, so we have it all in one

place.

...

...

>>
>> == Proposal: HTTP push to local db storage ==
>>
>> * Every time an item on Wikidata is changed, an HTTP push is issued to
all
>> subscribing clients (wikis)
>> ** initially, "subscriptions" are just entries in an array in the
configuration.
>> ** Pushes can be done via the job queue.
>> ** pushing is done via the mediawiki API, but other protocols such as
PubSub
>> Hubbub / AtomPub can easily be added to support 3rd parties.
>> ** pushes need to be authenticated, so we don't get malicious crap.
Pushes
>> should be done using a special user with a special user right.
>> ** the push may contain either the full set of information for the
item, or just
>> a delta (diff) + hash for integrity check (in case an update was
missed).
>>
>> * When the client receives a push, it does two things:
>> *# write the fresh data into a local database table (the local wikidata
cache)
>> *# invalidate the (parser) cache for all pages that use the respective
item (for
>> now we can assume that we know this from the language links)
>> *#* if we only update language links, the page doesn't even need to be
>> re-parsed: we just update the languagelinks in the cached ParserOutput
object.
>>
>> * when a page is rendered, interlanguage links and other info is taken
from the
>> local wikidata cache. No queries are made to wikidata during
parsing/rendering.
>>
>> * In case an update is missed, we need a mechanism to allow requesting
a full
>> purge and re-fetch of all data from on the client side and not just
wait until
>> the next push which might very well take a very long time to happen.
>> ** There needs to be a manual option for when someone detects this.
maybe
>> action=purge can be made to do this. Simple cache-invalidation however
shouldn't
>> pull info from wikidata.
>> **A time-to-live could be added to the local copy of the data so that
it's
>> updated by doing a pull periodically so the data does not stay stale
>> indefinitely after a failed push.
>>
>> === Variation: shared database tables ===
>>
>> Instead of having a local wikidata cache on each wiki (which may grow
big - a
>> first guesstimate of Jeroen and Reedy is up to 1TB total, for all
wikis), all
>> client wikis could  access the same central database table(s) managed
by the
>> wikidata wiki.
>>
>> * this is similar to the way the globalusage extension tracks the usage
of
>> commons images
>> * whenever a page is re-rendered, the local wiki would query the table
in the
>> wikidata db. This means a cross-cluster db query whenever a page is
rendered,
>> instead a local query.
>> * the HTTP push mechanism described above would still be needed to
purge the
>> parser cache when needed. But the push requests would not need to
contain the
>> updated data, they may just be requests to purge the cache.
>> * the ability for full HTTP pushes (using the mediawiki API or some
other
>> interface) would still be desirable for 3rd party integration.
>>
>> * This approach greatly lowers the amount of space used in the database
>> * it doesn't change the number of http requests made
>> ** it does however reduce the amount of data transferred via http (but
not by
>> much, at least not compared to pushing diffs)
>> * it doesn't change the number of database requests, but it introduces
>> cross-cluster requests
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chad

10:13 a.m.

On Mon, Apr 23, 2012 at 10:07 AM, Petr Bena benapetr@gmail.com wrote:

...

I think it would be much better if the local wikis where it is supposed to access this would have some sort of client extension which would allow them to render the content using the db of wikidata. That would be much simpler and faster

I agree with Petr here. I think doing it like we do FileRepo stuff would make the most sense--have an abstract base that can either connect via DB and skip those HTTP requests (for in- cluster usage) or via the API (3rd-party sites).

-Chad

Platonides

11:28 a.m.

On 23/04/12 14:45, Daniel Kinzler wrote:

...

*#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.

It's not that simple, for instance, they may be several ParserOutputs for the same page. On the bright side, you probably don't need it. I'd expect that if interwikis are handled through wikidata, they are completely replaced through a hook, so no need to touch the ParserOutput objects.

...

*# invalidate the (parser) cache for all pages that use the respective item (for now we can assume that we know this from the language links)

And in such case, you don't need to invalidate the parser cache. Only if it was factual data embedded into the page.

I think a save/purge shall always fetch the data. We can't store the copy in the parsed object. What we can do is to fetch is from a local cache or directly from the origin one.

You mention the cache for the push model, but I think it deserves a clearer separation.

...

=== Variation: shared database tables === (...)

This approach greatly lowers the amount of space used in the database

it doesn't change the number of http requests made

** it does however reduce the amount of data transferred via http (but not by much, at least not compared to pushing diffs)

it doesn't change the number of database requests, but it introduces

cross-cluster requests

You'd probably also want multiple dbs (let's call them WikiData repositories), partitioned by content (and its update frequency). You could then use different frontends (as Chad says, "similar to FileRepo"). So, a WikiData repository with the atom properties of each element would happily live in a dba file. Interwikis would have to be on a MySQL db, etc.

Daniel Kinzler

12:42 p.m.

New subject: Request for Comments: Cross site data access for Wikidata

On 23.04.2012 17:28, Platonides wrote:

...

On 23/04/12 14:45, Daniel Kinzler wrote:

...
*#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.

It's not that simple, for instance, they may be several ParserOutputs for the same page. On the bright side, you probably don't need it. I'd expect that if interwikis are handled through wikidata, they are completely replaced through a hook, so no need to touch the ParserOutput objects.

I would go that way if we were just talking about languagelinks. But we have to provide for phase II (infoboxes) and III (automated lists) too. Since we'll have to re-parse in most cases anyway (and parsing pages without infoboxes tends to be cheaper anyway), I see no benefit in spending time on inventing a way to bypass parsing. It's tempting, granted, but it seems a distraction atm.

...

...
*# invalidate the (parser) cache for all pages that use the respective item (for now we can assume that we know this from the language links)

And in such case, you don't need to invalidate the parser cache. Only if it was factual data embedded into the page.

Which will be a very frequent case in the next phase: most infoboxes will (at some point) work like that.

...

I think a save/purge shall always fetch the data. We can't store the copy in the parsed object.

well, for languagelinks, we already do, and will probably keep doing it. Other data, which will be used in the page content, shouldn't be stored in the parser output. The parser should take them from some cache.

...

What we can do is to fetch is from a local cache or directly from the origin one.

Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like plugins for that, sure. But:

The real question is how purging and updating will work. Pushing? Polling? Purge-and-pull?

...

You mention the cache for the push model, but I think it deserves a clearer separation.

Can you explain what you have in mind?

...

You'd probably also want multiple dbs (let's call them WikiData repositories), partitioned by content (and its update frequency). You could then use different frontends (as Chad says, "similar to FileRepo"). So, a WikiData repository with the atom properties of each element would happily live in a dba file. Interwikis would have to be on a MySQL db, etc.

This is what I was aiming at with the DataTransclusion extension a while back.

But currently, we are not building a tool for including arbitrary data sources in wikipedia. We are building a central database for maintaining factual information. Our main objective is to get that done.

A design that is flexible enough to easily allow for future inclusion of other data sources would be nice. As long as the abstraction doesn't get in the way.

Anyway, it seems that it boils down to this:

1) The client needs some (abstracted?) way to access the reporitory/repositories 2) The repo needs to be able to notify the client sites about changes, be it via push, pr purge, or polling. 3) We'll need a local cache or cross-site database access.

So, which combination of these techniques would you prefer?

-- daniel

Platonides

1:20 p.m.

On 23/04/12 18:42, Daniel Kinzler wrote:

...

On 23.04.2012 17:28, Platonides wrote:

...
On 23/04/12 14:45, Daniel Kinzler wrote:

...
*#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.

It's not that simple, for instance, they may be several ParserOutputs for the same page. On the bright side, you probably don't need it. I'd expect that if interwikis are handled through wikidata, they are completely replaced through a hook, so no need to touch the ParserOutput objects.

I would go that way if we were just talking about languagelinks. But we have to provide for phase II (infoboxes) and III (automated lists) too. Since we'll have to re-parse in most cases anyway (and parsing pages without infoboxes tends to be cheaper anyway), I see no benefit in spending time on inventing a way to bypass parsing. It's tempting, granted, but it seems a distraction atm.

Sure, but in those cases you need to reparse the full page. No need to make tricks modifying the ParserOutput. :) So, if you want to skip the reparsing for iw fine, but just use a hook.

...

...
I think a save/purge shall always fetch the data. We can't store the copy in the parsed object.

well, for languagelinks, we already do, and will probably keep doing it. Other data, which will be used in the page content, shouldn't be stored in the parser output. The parser should take them from some cache.

The ParserOutput is a parsed representation of the wikitext. The cached wikidata interwikis shouldn't be stored there (or at least, not only there, in case it saved the interwikis as they were on last full-render).

...

...
What we can do is to fetch is from a local cache or directly from the origin one.

Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like plugins for that, sure. But:

The real question is how purging and updating will work. Pushing? Polling? Purge-and-pull?

...
You mention the cache for the push model, but I think it deserves a clearer separation.

Can you explain what you have in mind?

I mean, they are based in the same concept. What really matters is how things reach the db. I'd have WikiData db replicated to {{places}}. For WMF, all wikis could connect directly to the main instance, have a slave "assigned" to each cluster... Then on each page render, the variables used could be checked with the latest version (unless checked in last x minutes) and trigger a rerender if different.

So, suppose a page uses the fact Germany{capital:"Berlin";language:"German"}, it would store that along the version of WikiData used (eg. Wikidata 2.0, Germany 488584364).

When going to show it, it would check: 1) Is the latest WikiData version newer than 2.0? (No-> go to 5) 2) Is the Germany module newer than 488584364? (No-> Store that it's up to date to WikiData 3, go to 5) 3) Fetch Germany data. If the used data hasn't changed, update the metadata. Go to 5. 4) Re-render the page. 5) Show contents.

As for actively purging the pages content, that's interesting only for the anons. You'd need a script able to replicate a purge for a WikiData changes range. That'd basically perform the above steps, but making the render through the job queue. A normal wiki would call those functions while replicating, but wikis with a shared db (or dropping full files with newer data) would run it standalone (plus utility on screw ups).

...

...
You'd probably also want multiple dbs (let's call them WikiData repositories), partitioned by content (and its update frequency). You could then use different frontends (as Chad says, "similar to FileRepo"). So, a WikiData repository with the atom properties of each element would happily live in a dba file. Interwikis would have to be on a MySQL db, etc.

This is what I was aiming at with the DataTransclusion extension a while back.

But currently, we are not building a tool for including arbitrary data sources in wikipedia. We are building a central database for maintaining factual information. Our main objective is to get that done.

Not arbitrary, but having different sources (repositories), even if they are under control of the same entity. Mostly interesting for slow-fast altough I'm sure reusers would find more use cases, such as only downloading the db about this section.

...

A design that is flexible enough to easily allow for future inclusion of other data sources would be nice. As long as the abstraction doesn't get in the way.

Anyway, it seems that it boils down to this:

The client needs some (abstracted?) way to access the reporitory/repositories

The repo needs to be able to notify the client sites about changes, be it via

push, pr purge, or polling. 3) We'll need a local cache or cross-site database access.

So, which combination of these techniques would you prefer?

-- daniel

I'd use a pull-based model. That seems to be what fits better with current MediaWiki model. But it isn't too relevant at this time (or you have advanced a lot by now!).

Daniel Kinzler

1:34 p.m.

New subject: Request for Comments: Cross site data access for Wikidata

Thanks for the input Platonides.

I'll have to re-read your comments to fully understand them. For now, just a quick question:

...

When going to show it, it would check:

Is the latest WikiData version newer than 2.0? (No-> go to 5)

Is the Germany module newer than 488584364? (No-> Store that it's up

to date to WikiData 3, go to 5) 3) Fetch Germany data. If the used data hasn't changed, update the metadata. Go to 5. 4) Re-render the page. 5) Show contents.

You think making a db query to check if the data is up to date, every time the page is *viewed*, is feasible? I would have though this prohibitively expensive... it would be nice and simple, of course.

The approach of marking the rendered page data as stale (using page_touched) whenever the data changes seems much more efficient. Though it does introduce some additional complexity.

Also, checking on every page view is out of the questions for external sites, right? So we'd still need a push interface for these...

-- daniel

Platonides

2:08 p.m.

Am 23/04/12 19:34, Daniel Kinzler schrieb:

...

You think making a db query to check if the data is up to date, every time the page is *viewed*, is feasible? I would have though this prohibitively expensive... it would be nice and simple, of course.

The approach of marking the rendered page data as stale (using page_touched) whenever the data changes seems much more efficient. Though it does introduce some additional complexity.

Viewed by a logged in user, ie. the same case when we check page_touched. Also note we are checking against a local cache. The way it's updated is unspecified :)

...

Also, checking on every page view is out of the questions for external sites, right? So we'd still need a push interface for these...

I think they'd use a cache with a configured ttl. So they wouldn't actually be fetching it on each view, only every X hours.

Gabriel Wicke

1:47 p.m.

On 04/23/2012 02:45 PM, Daniel Kinzler wrote:

...

In case an update is missed, we need a mechanism to allow requesting a full

purge and re-fetch of all data from on the client side and not just wait until the next push which might very well take a very long time to happen.

Once the data set becomes large and the change rate drops, this would be a very expensive way to catch up. You could use sequence numbers for changes to allow clients to detect missed changes and selectively retrieve all changes since the last contact.

In general, best-effort push / change notifications with bounded waiting for slow clients combined with an efficient way to catch up should be more reliable than push only. You don't really want to do a lot of buffering for slow clients while pushing.

If you are planning to render large, standardized page fragments with little to no input from the wiki page, then it might also become interesting to directly load fragments using JS in the browser or through a proxy with ESI-like capabilities for clients without JS.

Gabriel

Ryan Lane

2:06 p.m.

Why not just use a queue? We use the job queue for this right now for nearly the same purpose. The job queue isn't amazing, but it works. Maybe someone should replace this with a better system while they are at it?

On Mon, Apr 23, 2012 at 5:45 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Hi all!

The proposal is at http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage, I have copied it below too.

Please have a lot and let us know if you think this is viable, and which of the two variants you deem better!

Thanks, -- daniel

PS: Please keep the discussion on wikitech-l, so we have it all in one place.


== Proposal: HTTP push to local db storage ==

* Every time an item on Wikidata is changed, an HTTP push is issued to all
subscribing clients (wikis)
** initially, "subscriptions" are just entries in an array in the configuration.
** Pushes can be done via the job queue.
** pushing is done via the mediawiki API, but other protocols such as PubSub
Hubbub / AtomPub can easily be added to support 3rd parties.
** pushes need to be authenticated, so we don't get malicious crap. Pushes
should be done using a special user with a special user right.
** the push may contain either the full set of information for the item, or just
a delta (diff) + hash for integrity check (in case an update was missed).

* When the client receives a push, it does two things:
*# write the fresh data into a local database table (the local wikidata cache)
*# invalidate the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.

* when a page is rendered, interlanguage links and other info is taken from the
local wikidata cache. No queries are made to wikidata during parsing/rendering.

* In case an update is missed, we need a mechanism to allow requesting a full
purge and re-fetch of all data from on the client side and not just wait until
the next push which might very well take a very long time to happen.
** There needs to be a manual option for when someone detects this. maybe
action=purge can be made to do this. Simple cache-invalidation however shouldn't
pull info from wikidata.
**A time-to-live could be added to the local copy of the data so that it's
updated by doing a pull periodically so the data does not stay stale
indefinitely after a failed push.

=== Variation: shared database tables ===

Instead of having a local wikidata cache on each wiki (which may grow big - a
first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
client wikis could  access the same central database table(s) managed by the
wikidata wiki.

* this is similar to the way the globalusage extension tracks the usage of
commons images
* whenever a page is re-rendered, the local wiki would query the table in the
wikidata db. This means a cross-cluster db query whenever a page is rendered,
instead a local query.
* the HTTP push mechanism described above would still be needed to purge the
parser cache when needed. But the push requests would not need to contain the
updated data, they may just be requests to purge the cache.
* the ability for full HTTP pushes (using the mediawiki API or some other
interface) would still be desirable for 3rd party integration.

* This approach greatly lowers the amount of space used in the database
* it doesn't change the number of http requests made
** it does however reduce the amount of data transferred via http (but not by
much, at least not compared to pushing diffs)
* it doesn't change the number of database requests, but it introduces
cross-cluster requests



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

3:28 p.m.

New subject: Request for Comments: Cross site data access for Wikidata

On 23.04.2012 20:06, Ryan Lane wrote:

...

Why not just use a queue?

Well yes, a queue... for what exactly would be queued, and what precisely would the consimer do? where?s the queue? on the repo or on the client?

-- daniel

4640

Age (days ago)

4640

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

7 participants

tags (0)

participants (7)

Chad
Daniel Kinzler
Gabriel Wicke
Gerard Meijssen
Petr Bena
Platonides
Ryan Lane