Zero is rapidly growing, but its architecture is falling behind, and needs to be revised.
*== Zero Partner Requirements ==* * Support both smart phones (full JavaScript support) and feature phones (limited HTML/no js support) * Show carrier-specific, language-specific banner * Allow carriers to select what features are zero-rated: images and list of languages * Ask for user confirmation when navigating away from zero (images, non-zero languages, external sites)
*== Technical Requirements ==* * increase Varnish cache hits / minimize cache fragmentation * Set up and configure new partners without code changes * Use partner-supplied IP ranges as a preferred alternative to the geo-ip database for fundraising & analytic teams
*== Current state ==* Zero domain requests set X-Subdomain="ZERO", and treat the request as mobile. The backend uses X-Subdomain and X-CS headers to customize result. The cache is heavily fragmented due to its variance on both of these headers in addition to the variance set by MobileFrontend extension and MediaWiki core.
*== Proposals ==* In order to reduce Zero-caused fragmentation, we propose to shrink from one bucket per carrier (X-CS) to three general buckets: * smart phones bucket -- banner and site modifications are done on the client in javascript * feature phones -- HTML only, the banner is inserted by the ESI ** for carriers with free images ** for carriers without free images
*=== Varnish logic ===* * Parse User-Agent to distinguish between desktop / mobile / feature phone: X-Device-Type=desktop|mobile|legacy * Use IP -> X-CS lookup (under development by OPs) to convert client's IP into X-CS header * If X-CS && X-Device-Type == 'legacy': Use IP -> X-Images lookup (same lookup plugin, different database file) to determine if carrier allows images
Since each carrier has its own list of free languages, language links on feature phones will point to origin, which will either silently redirect or ask for confirmation.
*=== ZERO vs M ===* Even though I think zero. and m. subdomains should both go the way of the dodo to make each article have just one canonical location (no more linking & Google issues) , this won't happen until we are fully migrated to Varnish and make some mobile code changes (and possibly other changes that I am not aware of).
At the same time, we should try to get rid of ZERO wherever possible. There are two technical differences between m & zero: zero shows a link to image instead of the actual image, and a big red zero warning is shown if the carrier is not detected. There is also an organizational difference -- some carriers only whitelist zero, some - only m, and some -- both zero & m subdomains.
This spreadsheethttps://docs.google.com/spreadsheet/ccc?key=0As-T7jJ1slQGdDd3WjhkVmk5SWc5OEdDZVdLYlA2M2c&usp=sharing shows all configurations we allow, and how we handle them at the moment. Each case requires a different handling, so proposals are listed there.
In general, we should manipulate image links in M for carriers who don't allow them, and always redirect ZERO to M unless M is not whitelisted, in which case convince carrier to change their whitelist.
I'm excited to see this all moving forward! I think there are potential boons for regular mobile AND zero experiences in here. Responses inline below.
On Thu, May 30, 2013 at 10:16 AM, Yuri Astrakhan yastrakhan@wikimedia.orgwrote:
*== Proposals ==* In order to reduce Zero-caused fragmentation, we propose to shrink from one bucket per carrier (X-CS) to three general buckets:
- smart phones bucket -- banner and site modifications are done on the
client in javascript
For smart devices (more than just phones!), is there any reason you'd need serve different HTML than what is already being served by MobileFrontend? Note that there would need to be one bucket for HTML with images, another for without (as currently is the case for smartphones accessing MobileFrontend). Are there 'site modifications' that Zero needs to do that are different from MobileFrontend?
- feature phones -- HTML only, the banner is inserted by the ESI
** for carriers with free images ** for carriers without free images
What about including ESI tags for banners for smart devices as well as feature phones, then either use ESI to insert the banner for both device types or, alternatively, for smart devices don't let Varnish populate the ESI chunk and instead use JS to replace the ESI tags with the banner? That way we can still serve the same HTML for smart phones and feature phones with images (one less thing for which to vary the cache).
Are there carrier-specific things that would result in different HTML for devices that do not support JS, or can you get away with providing the same non-js experience for Zero as MobileFrontend (aside from the banner, presumably handled by ESI)? If not currently, do you think its feasible to do that (eg make carrier-variable links get handled via special pages so we can always rely on the same URIs)? Again, it would be nice if we could just rely on the same HTML to further reduce cache variance. It would be cool if MobileFrontend and Zero shared buckets and they were limited to:
* HTML + images * HTML - images * WAP
Out of curiosity, is there WAP support in Zero? I noticed some comments like '# WAP' in the varnish acls for Zero, so I presume so. Is the Zero WAP experience different than the MobileFrontend WAP experience?
Since we improved MobileFrontend to no longer vary the cache on X-Device, I've been surprised to not see a significant increase in our cache hit ratio (which warrants further investigation but that's another email). Are there ways we can do a deeper analysis of the state of the varnish cache to determine just how fragmented it is, why, and how much of a problem it actually is? I believe I've asked this before and was met with a response of 'not really' - but maybe things have changed now, or others on this list have different insight. I think we've mostly approached the issue with a lot more assumption than informed analysis, and if possible I think it would be good to change that.
*=== Varnish logic ===*
- Parse User-Agent to distinguish between desktop / mobile / feature phone:
X-Device-Type=desktop|mobile|legacy
What is 'legacy'? Why would you ever set X-Device-Type to 'desktop'? We already have decent device detection at the Varnish layer for MobileFrontend, however the devices aren't bucketed quite the way I think you'll need - but that should be straightforward to add into or along side the existing device detection.
...
*=== ZERO vs M ===* ... In general, we should manipulate image links in M for carriers who don't allow them, and always redirect ZERO to M unless M is not whitelisted, in which case convince carrier to change their whitelist.
What do you mean by 'manipulate image links in M'? Do you mean just don't display images (like currently happens) when images are disabled (X-Images: No)?
I dont think this should be a big deal especially if we can determine the value of X-Images for specific carriers at the Varnish level rather than at the application level - unless you're suggesting that something else needs to happen with image links.
- feature phones -- HTML only, the banner is inserted by the ESI
** for carriers with free images ** for carriers without free images
What about including ESI tags for banners for smart devices as well as feature phones, then either use ESI to insert the banner for both device types or, alternatively, for smart devices don't let Varnish populate the ESI chunk and instead use JS to replace the ESI tags with the banner? That way we can still serve the same HTML for smart phones and feature phones with images (one less thing for which to vary the cache).
I think the verdict is still out on whether it's better to use ESI for Banners in Varnish or use JS for that client-side. I guess we'll have to test and see.
Are there carrier-specific things that would result in different HTML for devices that do not support JS, or can you get away with providing the same non-js experience for Zero as MobileFrontend (aside from the banner, presumably handled by ESI)? If not currently, do you think its feasible to do that (eg make carrier-variable links get handled via special pages so we can always rely on the same URIs)? Again, it would be nice if we could just rely on the same HTML to further reduce cache variance. It would be cool if MobileFrontend and Zero shared buckets and they were limited to:
- HTML + images
- HTML - images
- WAP
That would be nice.
Since we improved MobileFrontend to no longer vary the cache on X-Device, I've been surprised to not see a significant increase in our cache hit ratio (which warrants further investigation but that's another email). Are there ways we can do a deeper analysis of the state of the varnish cache to determine just how fragmented it is, why, and how much of a problem it actually is? I believe I've asked this before and was met with a response of 'not really' - but maybe things have changed now, or others on this list have different insight. I think we've mostly approached the issue with a lot more assumption than informed analysis, and if possible I think it would be good to change that.
Yeah, we should look into that. We've already flagged a few possible culprits, and we're also working on the migration of the desktop wiki cluster from Squid to Varnish, which has some of the same issues with variance (sessions, XVO, cookies, Accept-Language...) as MobileFrontend does. After we've finished migrating that and confirmed that it's working well, we want to unify those clusters' configurations a bit more, and that by itself should give us additional opportunity to compare some strategies there.
We've since also figured out that the way we've calculate cache efficiency with Varnish is not exactly ideal; unlike Squid, cache purges are done as HTTP requests to Varnish. Therefore in Varnish, those cache lookups are calculated into the cache hit rate, which isn't very helpful. To make things worse, the few hundreds of purges a second vs actual client traffic matter a lot more on the mobile cluster (with much less traffic but a big content set) than it does for our other clusters. So until we can factor that out in the Varnish counters (might be possible in Varnish 4.0), we'll have to look at other metrics.
More useful therefore is to check the actual backend fetches ("backend_req"), and these appear to have gone down some. Annoyingly, every time we restart a Varnish instance we get a spike in the Ganglia graphs, making the long-term graphs pretty much unusable. To fix that we'll either need to patch Ganglia itself or move to some other stats engine (statsd?). So we have a bit of work to do there on the Ops front.
Note that we're about to replace all Varnish caches in eqiad by (fewer) newer, much bigger boxes, and we've decided to also upgrade the 4 mobile boxes with those same specs. And we're also doing that in our new west coast caching data center as well as esams. This will increase the mobile cache size a lot, and will hopefully help by throwing resources at the problem.
Hi Yuri,
Thanks for writing this up. I'll put some comments and questions inline.
On May 30, 2013, at 7:16 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
*== Technical Requirements ==*
- increase Varnish cache hits / minimize cache fragmentation
- Set up and configure new partners without code changes
- Use partner-supplied IP ranges as a preferred alternative to the geo-ip
database for fundraising & analytic teams
Note that a Varnish VMOD to support the latter is being written at the moment by Brandon Black.
*== Current state ==* Zero domain requests set X-Subdomain="ZERO", and treat the request as mobile. The backend uses X-Subdomain and X-CS headers to customize result. The cache is heavily fragmented due to its variance on both of these headers in addition to the variance set by MobileFrontend extension and MediaWiki core.
...and also, variance due to the different hostname (and thus URL).
*== Proposals ==* In order to reduce Zero-caused fragmentation, we propose to shrink from one bucket per carrier (X-CS) to three general buckets:
- smart phones bucket -- banner and site modifications are done on the
client in javascript
- feature phones -- HTML only, the banner is inserted by the ESI
** for carriers with free images ** for carriers without free images
*=== Varnish logic ===*
- Parse User-Agent to distinguish between desktop / mobile / feature phone:
X-Device-Type=desktop|mobile|legacy
Using the OpenDDR library?
- Use IP -> X-CS lookup (under development by OPs) to convert client's IP
into X-CS header
- If X-CS && X-Device-Type == 'legacy': Use IP -> X-Images lookup (same
lookup plugin, different database file) to determine if carrier allows images
Hopefully we can set the X-Images header straight from the ip database.
Since each carrier has its own list of free languages, language links on feature phones will point to origin, which will either silently redirect or ask for confirmation.
Perhaps we can store the list of supported languages for the carrier in the ip database as well?
*=== ZERO vs M ===* Even though I think zero. and m. subdomains should both go the way of the dodo to make each article have just one canonical location (no more linking & Google issues) , this won't happen until we are fully migrated to Varnish and make some mobile code changes (and possibly other changes that I am not aware of).
What do you mean by "until we are fully migrated to Varnish"? MobileFrontend has always exclusively been on Varnish.
At the same time, we should try to get rid of ZERO wherever possible. There are two technical differences between m & zero: zero shows a link to image instead of the actual image, and a big red zero warning is shown if the carrier is not detected. There is also an organizational difference -- some carriers only whitelist zero, some - only m, and some -- both zero & m subdomains.
I'm still a little confused about "m" vs "ZERO" and "images" vs "no images". That probably means others are too. :) Can you elaborate a little on that? I thought that was pretty much the same, but according to your spreadsheet that doesn't seem to be the case?
Overall this sounds reasonable I think, we'll just need to work out the details.
As Arthur also said in this thread, I'd like to keep zero & m completely aligned, ideally sharing the Varnish cache objects and the mobile device detection at the Varnish level as much as possible. I don't think we disagree here.
wikitech-l@lists.wikimedia.org