Hi all!
The wikidata team has been discussing how to best make data from wikidata
available on local wikis. Fetching the data via HTTP whenever a page is
re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a
push-based architecture.
The proposal is at
<http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>,
I have copied it below too.
Please have a lot and let us know if you think this is viable, and which of the
two variants you deem better!
Thanks,
-- daniel
PS: Please keep the discussion on wikitech-l, so we have it all in one place.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== Proposal: HTTP push to local db storage ==
* Every time an item on Wikidata is changed, an HTTP push is issued to all
subscribing clients (wikis)
** initially, "subscriptions" are just entries in an array in the
configuration.
** Pushes can be done via the job queue.
** pushing is done via the mediawiki API, but other protocols such as PubSub
Hubbub / AtomPub can easily be added to support 3rd parties.
** pushes need to be authenticated, so we don't get malicious crap. Pushes
should be done using a special user with a special user right.
** the push may contain either the full set of information for the item, or just
a delta (diff) + hash for integrity check (in case an update was missed).
* When the client receives a push, it does two things:
*# write the fresh data into a local database table (the local wikidata cache)
*# invalidate the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.
* when a page is rendered, interlanguage links and other info is taken from the
local wikidata cache. No queries are made to wikidata during parsing/rendering.
* In case an update is missed, we need a mechanism to allow requesting a full
purge and re-fetch of all data from on the client side and not just wait until
the next push which might very well take a very long time to happen.
** There needs to be a manual option for when someone detects this. maybe
action=purge can be made to do this. Simple cache-invalidation however shouldn't
pull info from wikidata.
**A time-to-live could be added to the local copy of the data so that it's
updated by doing a pull periodically so the data does not stay stale
indefinitely after a failed push.
=== Variation: shared database tables ===
Instead of having a local wikidata cache on each wiki (which may grow big - a
first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
client wikis could access the same central database table(s) managed by the
wikidata wiki.
* this is similar to the way the globalusage extension tracks the usage of
commons images
* whenever a page is re-rendered, the local wiki would query the table in the
wikidata db. This means a cross-cluster db query whenever a page is rendered,
instead a local query.
* the HTTP push mechanism described above would still be needed to purge the
parser cache when needed. But the push requests would not need to contain the
updated data, they may just be requests to purge the cache.
* the ability for full HTTP pushes (using the mediawiki API or some other
interface) would still be desirable for 3rd party integration.
* This approach greatly lowers the amount of space used in the database
* it doesn't change the number of http requests made
** it does however reduce the amount of data transferred via http (but not by
much, at least not compared to pushing diffs)
* it doesn't change the number of database requests, but it introduces
cross-cluster requests