Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata

23 Apr 2012

      I mean, in simple words:
Your idea: when the data on wikidata is changed the new content is
pushed to all local wikis / somewhere
My idea: local wikis retrieve data from wikidata db directly, no need
to push anything on change
On Mon, Apr 23, 2012 at 4:07 PM, Petr Bena benapetr@gmail.com wrote:
...
I think it would be much better if the local wikis where it is
supposed to access this would have some sort of client extension which
would allow them to render the content using the db of wikidata. That
would be much simpler and faster
On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler daniel@brightbyte.de wrote:
...
Hi all!
The wikidata team has been discussing how to best make data from wikidata
available on local wikis. Fetching the data via HTTP whenever a page is
re-rendered doesn't  seem prudent, so we (mainly Jeroen) came up with a
push-based architecture.
The proposal is at
http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage,
I have copied it below too.
Please have a lot and let us know if you think this is viable, and which of the
two variants you deem better!
Thanks,
-- daniel
PS: Please keep the discussion on  wikitech-l, so we have it all in one place.

== Proposal: HTTP push to local db storage ==

* Every time an item on Wikidata is changed, an HTTP push is issued to all
subscribing clients (wikis)
** initially, "subscriptions" are just entries in an array in the configuration.
** Pushes can be done via the job queue.
** pushing is done via the mediawiki API, but other protocols such as PubSub
Hubbub / AtomPub can easily be added to support 3rd parties.
** pushes need to be authenticated, so we don't get malicious crap. Pushes
should be done using a special user with a special user right.
** the push may contain either the full set of information for the item, or just
a delta (diff) + hash for integrity check (in case an update was missed).

* When the client receives a push, it does two things:
*# write the fresh data into a local database table (the local wikidata cache)
*# invalidate the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.

* when a page is rendered, interlanguage links and other info is taken from the
local wikidata cache. No queries are made to wikidata during parsing/rendering.

* In case an update is missed, we need a mechanism to allow requesting a full
purge and re-fetch of all data from on the client side and not just wait until
the next push which might very well take a very long time to happen.
** There needs to be a manual option for when someone detects this. maybe
action=purge can be made to do this. Simple cache-invalidation however shouldn't
pull info from wikidata.
**A time-to-live could be added to the local copy of the data so that it's
updated by doing a pull periodically so the data does not stay stale
indefinitely after a failed push.

=== Variation: shared database tables ===

Instead of having a local wikidata cache on each wiki (which may grow big - a
first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
client wikis could  access the same central database table(s) managed by the
wikidata wiki.

* this is similar to the way the globalusage extension tracks the usage of
commons images
* whenever a page is re-rendered, the local wiki would query the table in the
wikidata db. This means a cross-cluster db query whenever a page is rendered,
instead a local query.
* the HTTP push mechanism described above would still be needed to purge the
parser cache when needed. But the push requests would not need to contain the
updated data, they may just be requests to purge the cache.
* the ability for full HTTP pushes (using the mediawiki API or some other
interface) would still be desirable for 3rd party integration.

* This approach greatly lowers the amount of space used in the database
* it doesn't change the number of http requests made
** it does however reduce the amount of data transferred via http (but not by
much, at least not compared to pushing diffs)
* it doesn't change the number of database requests, but it introduces
cross-cluster requests

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Request for Comments: Cross site data access for Wikidata