Parser cache update/migration strategies

List overview All Threads
Download

newer

older

RFC meeting today

Standardising icons across projects

Daniel Kinzler

9 Sep 2014 9 Sep '14

10:03 a.m.

Hi all!

tl;dr: How to best handle the situation of an old parser cache entry not containing all the info expected by a newly deployed version of code?

We are currently working to improve our usage of the parser cache for Wikibase/Wikidata. E.g., We are attaching additional information related to languagelinks the to ParserOutput, so we can use it in the skin when generating the sidebar.

However, when we change what gets stored in the parser cache, we still need to deal with old cache entries that do not yet have the desired information attached. Here's a few options we have if the expected info isn't in the cached ParserOutput:

1) ...then generate it on the fly. On every page view, until the parser cache is purged. This seems bad especially if generating the required info means hitting the database.

2) ...then invalidate the parser cache for this page, and then a) just live with this request missing a bit of output, or b) generate on the fly c) trigger a self-redirect.

3) ...then generated it, attach it to the ParserOutput, and push the updated ParserOutput object back into the cache. This seems nice, but I'm not sure how to do that.

4) ...then force a full re-rendering and re-caching of the page, then continue. I'm not sure how to do this cleanly.

So, the simplest solution seems to be 2, but it means that we invalidate the parser cache of *every* page on the wiki potentially (though we will not hit the long tail of rarely viewed pages immediately). It effectively means that any such change requires all pages to be re-rendered eventually. Is that acceptable?

Solution 3 seems nice and surgical, just injecting the new info into the cached object. Is there a nice and clean way to *update* a parser cache entry like that, without re-generating it in full? Do you see any issues with this approach? Is it worth the trouble?

Any input would be great!

Thanks, daniel

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Show replies by date

aude

9 Sep 9 Sep

10:40 a.m.

On Tue, Sep 9, 2014 at 12:03 PM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Hi all!

tl;dr: How to best handle the situation of an old parser cache entry not containing all the info expected by a newly deployed version of code?

We are currently working to improve our usage of the parser cache for Wikibase/Wikidata. E.g., We are attaching additional information related to languagelinks the to ParserOutput, so we can use it in the skin when generating the sidebar.

However, when we change what gets stored in the parser cache, we still need to deal with old cache entries that do not yet have the desired information attached. Here's a few options we have if the expected info isn't in the cached ParserOutput:

...then generate it on the fly. On every page view, until the parser

cache is purged. This seems bad especially if generating the required info means hitting the database.

...then invalidate the parser cache for this page, and then a) just

live with this request missing a bit of output, or b) generate on the fly c) trigger a self-redirect.

...then generated it, attach it to the ParserOutput, and push the

updated ParserOutput object back into the cache. This seems nice, but I'm not sure how to do that.

https://gerrit.wikimedia.org/r/#/c/158879/ is my attempt to update ParserOutput cache entry, though it seems too simplistic a solution.

Any feedback on this would be great or suggestions on how to do this better, or maybe it's crazy idea. :P

Cheers, Katie

...

...then force a full re-rendering and re-caching of the page, then

continue. I'm not sure how to do this cleanly.

So, the simplest solution seems to be 2, but it means that we invalidate the parser cache of *every* page on the wiki potentially (though we will not hit the long tail of rarely viewed pages immediately). It effectively means that any such change requires all pages to be re-rendered eventually. Is that acceptable?

Solution 3 seems nice and surgical, just injecting the new info into the cached object. Is there a nice and clean way to *update* a parser cache entry like that, without re-generating it in full? Do you see any issues with this approach? Is it worth the trouble?

Any input would be great!

Thanks, daniel

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- @wikimediadc / @wikidata

Nikolas Everett

11:45 a.m.

Also option 5 could be to continue without the days until the parser cash is invalidated on its own. Maybe option 6 could be to continue without the data and invalidate the cache and completely rerender only some of the time. Like 5% of the time for the first couple hours then 25% of the time for a day then 100% of the time after that. It'd guarantee that the cache is good after a certain amount of time without causing a big spike ridge after deploys. All those options are less good then just updating the cache I think.

Nik On Sep 9, 2014 6:42 AM, "aude" aude.wiki@gmail.com wrote:

...

On Tue, Sep 9, 2014 at 12:03 PM, Daniel Kinzler daniel@brightbyte.de wrote:

...
Hi all!

tl;dr: How to best handle the situation of an old parser cache entry not containing all the info expected by a newly deployed version of code?

We are currently working to improve our usage of the parser cache for Wikibase/Wikidata. E.g., We are attaching additional information related

to

...
languagelinks the to ParserOutput, so we can use it in the skin when generating the sidebar.

However, when we change what gets stored in the parser cache, we still need to deal with old cache entries that do not yet have the desired information attached. Here's a few options we have if the expected info isn't in the cached ParserOutput:

...then generate it on the fly. On every page view, until the parser

cache is purged. This seems bad especially if generating the required info means hitting the database.

...then invalidate the parser cache for this page, and then a) just

live with this request missing a bit of output, or b) generate on the fly c)

trigger

...
a self-redirect.

...then generated it, attach it to the ParserOutput, and push the

updated ParserOutput object back into the cache. This seems nice, but I'm not

sure

...
how to do that.

https://gerrit.wikimedia.org/r/#/c/158879/ is my attempt to update ParserOutput cache entry, though it seems too simplistic a solution.

Any feedback on this would be great or suggestions on how to do this better, or maybe it's crazy idea. :P

Cheers, Katie

...

...then force a full re-rendering and re-caching of the page, then

continue. I'm not sure how to do this cleanly.

So, the simplest solution seems to be 2, but it means that we invalidate the parser cache of *every* page on the wiki potentially (though we will not hit the long tail of rarely viewed pages immediately). It effectively means that any such change requires all pages to be re-rendered eventually. Is that acceptable?

Solution 3 seems nice and surgical, just injecting the new info into the cached object. Is there a nice and clean way to *update* a parser cache entry

like

...
that, without re-generating it in full? Do you see any issues with this approach? Is it worth the trouble?

Any input would be great!

Thanks, daniel

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- @wikimediadc / @wikidata _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

noon

Am 09.09.2014 13:45, schrieb Nikolas Everett:

...

All those options are less good then just updating the cache I think.

Indeed. And that *sounds* simple enough. The issue is that we have to be sure to update the correct cache key, the exact one the OutputPage object in question was loaded from. Otherwise, we'll be updating the wrong key, and will read the incomplete object again, and try to update again, and again, on every page view.

Sadly, the mechanism for determining the parser cache key is quite complicated and rather opaque. The approach Katie tries in I1a11b200f0c looks fine at a glance, but even if i can verify that it works as expected on my machine, I have no idea how it will behave on the more strange wikis on the live cluster.

Any ideas who could help with that?

-- daniel

Nikolas Everett

4:30 p.m.

On Tue, Sep 9, 2014 at 8:00 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Am 09.09.2014 13:45, schrieb Nikolas Everett:

...
All those options are less good then just updating the cache I think.

Indeed. And that *sounds* simple enough. The issue is that we have to be sure to update the correct cache key, the exact one the OutputPage object in question was loaded from. Otherwise, we'll be updating the wrong key, and will read the incomplete object again, and try to update again, and again, on every page view.

Sadly, the mechanism for determining the parser cache key is quite complicated and rather opaque. The approach Katie tries in I1a11b200f0c looks fine at a glance, but even if i can verify that it works as expected on my machine, I have no idea how it will behave on the more strange wikis on the live cluster.

Any ideas who could help with that?

No, not really. My only experience with the parser cache was accidentally polluting it with broken pages one time.

I suppose one option is to be defensive around reusing the key. I mean, if you could check the key used to fetch from the parser cache and you had a cache hit then you know if you do a put you'll be setting _something_.

Another thing - I believe uncached calls to the parser are wrapped in pool counter acquisitions to make sure no two processes spend duplicate effort. You may want to acquire that to make sure anything you do that is heavy doesn't get done twice.

Once you start talking about that it might just be simpler to invalidate the whole entry.....

Another option: Kick off some kind of cache invalidation job that _slowly_ invalidates the appropriate parts of the cache. Something like how the varnish cache is invalidated on template change. That gives you marginally more control than randomized invalidation.

Nik

Tim Starling

10 Sep 10 Sep

5:44 a.m.

On 09/09/14 22:00, Daniel Kinzler wrote:

...

Sadly, the mechanism for determining the parser cache key is quite complicated and rather opaque.

It's only as complicated as it has to be, to support the desired features:

* Options which change the parser output. * Merging of parser output objects when a given option does not affect the output for a given input text, even though it may affect the output other inputs.

...

The approach Katie tries in I1a11b200f0c looks fine at a glance, but even if i can verify that it works as expected on my machine, I have no idea how it will behave on the more strange wikis on the live cluster.

It will probably work. It assumes that the parser output for the current article will always be added to the OutputPage before the SidebarBeforeOutput hook is called. If that assumption was violated on some popular URL, then it could waste quite a lot of CPU time.

It also assumes that the context page is always the same as the page which was parsed and added to the OutputPage -- another assumption which could have nasty consequences if it is violated.

I think it is fine to just invalidate all pages on the wiki. This can be done by deploying the parser change, then progressively increasing $wgCacheEpoch over the course of a week or two, until it is higher than the change deployment time. If you increase $wgCacheEpoch too fast, then you will get an overload.

-- Tim Starling

3587

Age (days ago)

3588

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

aude
Daniel Kinzler
Nikolas Everett
Tim Starling