Hi all!
The wikidata team has been discussing how to best make data from wikidata available on local wikis. Fetching the data via HTTP whenever a page is re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a push-based architecture.
The proposal is at http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage, I have copied it below too.
Please have a lot and let us know if you think this is viable, and which of the two variants you deem better!
Thanks, -- daniel
PS: Please keep the discussion on wikitech-l, so we have it all in one place.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== Proposal: HTTP push to local db storage ==
* Every time an item on Wikidata is changed, an HTTP push is issued to all subscribing clients (wikis) ** initially, "subscriptions" are just entries in an array in the configuration. ** Pushes can be done via the job queue. ** pushing is done via the mediawiki API, but other protocols such as PubSub Hubbub / AtomPub can easily be added to support 3rd parties. ** pushes need to be authenticated, so we don't get malicious crap. Pushes should be done using a special user with a special user right. ** the push may contain either the full set of information for the item, or just a delta (diff) + hash for integrity check (in case an update was missed).
* When the client receives a push, it does two things: *# write the fresh data into a local database table (the local wikidata cache) *# invalidate the (parser) cache for all pages that use the respective item (for now we can assume that we know this from the language links) *#* if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.
* when a page is rendered, interlanguage links and other info is taken from the local wikidata cache. No queries are made to wikidata during parsing/rendering.
* In case an update is missed, we need a mechanism to allow requesting a full purge and re-fetch of all data from on the client side and not just wait until the next push which might very well take a very long time to happen. ** There needs to be a manual option for when someone detects this. maybe action=purge can be made to do this. Simple cache-invalidation however shouldn't pull info from wikidata. **A time-to-live could be added to the local copy of the data so that it's updated by doing a pull periodically so the data does not stay stale indefinitely after a failed push.
=== Variation: shared database tables ===
Instead of having a local wikidata cache on each wiki (which may grow big - a first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all client wikis could access the same central database table(s) managed by the wikidata wiki.
* this is similar to the way the globalusage extension tracks the usage of commons images * whenever a page is re-rendered, the local wiki would query the table in the wikidata db. This means a cross-cluster db query whenever a page is rendered, instead a local query. * the HTTP push mechanism described above would still be needed to purge the parser cache when needed. But the push requests would not need to contain the updated data, they may just be requests to purge the cache. * the ability for full HTTP pushes (using the mediawiki API or some other interface) would still be desirable for 3rd party integration.
* This approach greatly lowers the amount of space used in the database * it doesn't change the number of http requests made ** it does however reduce the amount of data transferred via http (but not by much, at least not compared to pushing diffs) * it doesn't change the number of database requests, but it introduces cross-cluster requests