On Mon, Apr 28, 2014 at 7:08 AM, Faidon Liambotis <faidon@wikimedia.org> wrote:
If the content is more-or-less static (can be invalidated by either a
TTL or explicit purges on content changes) and isolated, caching at the
HTTP layer should be preferred.

Agreed. Of the requests we make, filerepoinfo and users essentially never change, imageusage and globalusage we can pretend to be static since we don't care about small inaccuracies; the problematic one is imageinfo.

Part of imageinfo is parsed from templates on the file description change, assuming those templates add the right markup to annotate the data they contain. We want to get communities to make their local templates behave similarly to the ones on Commons, so they can also be parsed; this is important for both MediaViewer and for eventually moving image metadata into Wikibase. This means editors will need to tweak a lot of templates and verify that the data is parsed correctly; if between the tweaking and the verification there is a one-day caching period, that would kill all such efforts.

I guess if either server load or roundtrip lag becomes a big issue, we could write some sort of separate gadget which editors could use to verify the API results, while MediaViewer could use caching, but that should be a last resort.

As for explicit purges, that seems to be a nasty business for API queries. Varnish supports ID-based invalidation, but their docs warn [1] that it does not scale well. The more scalable tag-based invalidation (hashtwo) is in the proprietary part of Varnish. URL-based purges would require reconstructing the exact same URL as the client made, including parameter ordering, pagename encoding flavors, maxage parameter etc.; not hard to do but a pain to maintain. What's even worse, for Commons images, an edit or reupload would mean purging API URLs across hundreds of wikis. So I don't think explicit invalidation would be doable.


[1] https://www.varnish-software.com/blog/advanced-cache-invalidation-strategies