Heya,
We are receiving reports [0] that pageview numbers for a small subset of articles are significantly lower then they used to be. See * http://stats.grok.se/en/latest90/Schizophrenia * http://stats.grok.se/en/latest90/Cancer * http://stats.grok.se/en/latest90/Depression_%28mood%29 (those links are for enwiki articles)
What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely.
Webstatsollector, the program that generates the data that is shown on stats.grok.se did not deduplicate counts for https and so we did expect a 50% drop. Thus, prior to disabling sending SSL traffic to udp2log we were overcounting. However, the drop is larger than 50% which means something else is going on as well.
For April 29th, 2013 for the 'http(s)://en.wikipedia.org/wiki/Cancer' article the following counts were calculated (using zcat sampled-1000.tsv.log-20130429.gz | cut -f 12 | grep "http://en.wikipedia.org/wiki/Cancer$" | wc -l) and changing the field 9 or 12 for url or referer and changing http/https
======================================================== | | direct requests | referer hits | | | (field 9) | (field 12) | -------------------------------------------------------- | http hits | 5 (5000) | 35 (35000) | | https hits | 0 (0) | 65 (65000) | ========================================================
(The first number is the actual observed number, the numbers in parentheses are the absolute numbers after multiplying by 1000 as that is the samping factor)
There are many https hits for the cancer article in the referer but none in the URL field, which could be an indication that the squids are not correctly logging Nginx SSL redirected requests. The reason we see so few http hits for the cancer article is obviously because Google sends people to the https version. Finally, we do see a lot of https hits in the referer, this is mostly to the upload domain and suggests that actually many people are reading this article.
Solutions
There are at least two different solutions to solve this problem: 1) Stop Google to index https articles by adding a <link rel="canonical" href="http://*.wikipedia.org/wiki/Foo" /> to every page. I belief this could be done in Mediawiki. The problem is similar to Google indexing the articles on the .m. domains and we resolved that as well.
2) Make sure that https hits are properly logged by Squid (assuming that is the problem).
I am sure there are other possible solutions, including setting the X-Proto-For header so please chime in if you disagree with the diagnosis or have an alternative solution.
Best,
Diederik
[0] *http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28miscellaneous%29/Arch... *http://en.wikipedia.org/wiki/Wikipedia:VPT#Page_view_stats_declining_from_22... *http://en.wikipedia.org/wiki/User_talk:Eloquence#View_stats_crashing_on_some... *User_talk:West.andrew.g#Page_view_stats_crashing_on_some.2C_but_not_all.2C_articles *http://en.wikipedia.org/wiki/User_talk:Jimbo_Wales#Page_view_stats_crashing_...
Can we exclude as a possible cause the launch of Google Quick View, which was publicly announced on April 16, as per my note to mobile-tech (copied below)? The timing of the pv drop in the examples you cite look suspiciously close to the launch.
"Google Search for mobile has a new feature called "Quick View". Right now, it only shows up for Wikipedia results and it allows users to load search results almost instantly."
http://insidesearch.blogspot.com/2013/04/making-your-mobile-search-faster_16...
It's the first time I hear about this feature (it looks like it has only been publicly announced today). As far as I understand, when clicking on the Quick View button, users of Google search on mobile will see a version of a Wikipedia article cached/hosted by Google as opposed to the live version. It makes perfect business sense for Google (same strategy as the Knowledge Graph to minimize outgoing traffic to Wikipedia) but will badly affect our mobile traffic.
Dario
On May 9, 2013, at 3:01 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Heya,
We are receiving reports [0] that pageview numbers for a small subset of articles are significantly lower then they used to be. See
- http://stats.grok.se/en/latest90/Schizophrenia
- http://stats.grok.se/en/latest90/Cancer
- http://stats.grok.se/en/latest90/Depression_%28mood%29
(those links are for enwiki articles)
What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely.
Webstatsollector, the program that generates the data that is shown on stats.grok.se did not deduplicate counts for https and so we did expect a 50% drop. Thus, prior to disabling sending SSL traffic to udp2log we were overcounting. However, the drop is larger than 50% which means something else is going on as well.
For April 29th, 2013 for the 'http(s)://en.wikipedia.org/wiki/Cancer' article the following counts were calculated (using zcat sampled-1000.tsv.log-20130429.gz | cut -f 12 | grep "http://en.wikipedia.org/wiki/Cancer$" | wc -l) and changing the field 9 or 12 for url or referer and changing http/https
======================================================== | | direct requests | referer hits | | | (field 9) | (field 12) |
| http hits | 5 (5000) | 35 (35000) | | https hits | 0 (0) | 65 (65000) | ========================================================
(The first number is the actual observed number, the numbers in parentheses are the absolute numbers after multiplying by 1000 as that is the samping factor)
There are many https hits for the cancer article in the referer but none in the URL field, which could be an indication that the squids are not correctly logging Nginx SSL redirected requests. The reason we see so few http hits for the cancer article is obviously because Google sends people to the https version. Finally, we do see a lot of https hits in the referer, this is mostly to the upload domain and suggests that actually many people are reading this article.
Solutions
There are at least two different solutions to solve this problem:
Stop Google to index https articles by adding a <link rel="canonical" href="http://*.wikipedia.org/wiki/Foo" /> to every page. I belief this could be done in Mediawiki. The problem is similar to Google indexing the articles on the .m. domains and we resolved that as well.
Make sure that https hits are properly logged by Squid (assuming that is the problem).
I am sure there are other possible solutions, including setting the X-Proto-For header so please chime in if you disagree with the diagnosis or have an alternative solution.
Best,
Diederik
[0] *http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28miscellaneous%29/Arch... *http://en.wikipedia.org/wiki/Wikipedia:VPT#Page_view_stats_declining_from_22... *http://en.wikipedia.org/wiki/User_talk:Eloquence#View_stats_crashing_on_some... *User_talk:West.andrew.g#Page_view_stats_crashing_on_some.2C_but_not_all.2C_articles *http://en.wikipedia.org/wiki/User_talk:Jimbo_Wales#Page_view_stats_crashing_...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I do not think that the underreporting of https articles is caused by Google Quick View. If Google Quick View was the cause then that would affect all articles, this issue only affects articles that are indexed under the https protocol (and that is a pretty small group). D
this assumes that Quick View caching targets all articles at the same rate, regardless of the volume of search queries or other factors, which sounds unlikely. They also confirmed that they are making incremental releases of Quick View on small batches of articles and I understand that we asked them aggregate data on Quick View requests which we could use to exclude this hypothesis.
On May 9, 2013, at 3:59 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
I do not think that the underreporting of https articles is caused by Google Quick View. If Google Quick View was the cause then that would affect all articles, this issue only affects articles that are indexed under the https protocol (and that is a pretty small group). D
-- Diederik van Liere Sent with Sparrow
On Thursday, May 9, 2013 at 6:34 PM, Dario Taraborelli wrote:
Can we exclude as a possible cause the launch of Google Quick View, which was publicly announced on April 16, as per my note to mobile-tech (copied below)? The timing of the pv drop in the examples you cite look suspiciously close to the launch.
"Google Search for mobile has a new feature called "Quick View". Right now, it only shows up for Wikipedia results and it allows users to load search results almost instantly."
http://insidesearch.blogspot.com/2013/04/making-your-mobile-search-faster_16...
It's the first time I hear about this feature (it looks like it has only been publicly announced today). As far as I understand, when clicking on the Quick View button, users of Google search on mobile will see a version of a Wikipedia article cached/hosted by Google as opposed to the live version. It makes perfect business sense for Google (same strategy as the Knowledge Graph to minimize outgoing traffic to Wikipedia) but will badly affect our mobile traffic.
Dario
On May 9, 2013, at 3:01 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Heya,
We are receiving reports [0] that pageview numbers for a small subset of articles are significantly lower then they used to be. See
- http://stats.grok.se/en/latest90/Schizophrenia
- http://stats.grok.se/en/latest90/Cancer
- http://stats.grok.se/en/latest90/Depression_%28mood%29
(those links are for enwiki articles)
What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely.
Webstatsollector, the program that generates the data that is shown on stats.grok.se did not deduplicate counts for https and so we did expect a 50% drop. Thus, prior to disabling sending SSL traffic to udp2log we were overcounting. However, the drop is larger than 50% which means something else is going on as well.
For April 29th, 2013 for the 'http(s)://en.wikipedia.org/wiki/Cancer' article the following counts were calculated (using zcat sampled-1000.tsv.log-20130429.gz | cut -f 12 | grep "http://en.wikipedia.org/wiki/Cancer$" | wc -l) and changing the field 9 or 12 for url or referer and changing http/https
======================================================== | | direct requests | referer hits | | | (field 9) | (field 12) |
| http hits | 5 (5000) | 35 (35000) | | https hits | 0 (0) | 65 (65000) | ========================================================
(The first number is the actual observed number, the numbers in parentheses are the absolute numbers after multiplying by 1000 as that is the samping factor)
There are many https hits for the cancer article in the referer but none in the URL field, which could be an indication that the squids are not correctly logging Nginx SSL redirected requests. The reason we see so few http hits for the cancer article is obviously because Google sends people to the https version. Finally, we do see a lot of https hits in the referer, this is mostly to the upload domain and suggests that actually many people are reading this article.
Solutions
There are at least two different solutions to solve this problem:
Stop Google to index https articles by adding a <link rel="canonical" href="http://*.wikipedia.org/wiki/Foo" /> to every page. I belief this could be done in Mediawiki. The problem is similar to Google indexing the articles on the .m. domains and we resolved that as well.
Make sure that https hits are properly logged by Squid (assuming that is the problem).
I am sure there are other possible solutions, including setting the X-Proto-For header so please chime in if you disagree with the diagnosis or have an alternative solution.
Best,
Diederik
[0] *http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28miscellaneous%29/Arch... *http://en.wikipedia.org/wiki/Wikipedia:VPT#Page_view_stats_declining_from_22... *http://en.wikipedia.org/wiki/User_talk:Eloquence#View_stats_crashing_on_some... *User_talk:West.andrew.g#Page_view_stats_crashing_on_some.2C_but_not_all.2C_articles *http://en.wikipedia.org/wiki/User_talk:Jimbo_Wales#Page_view_stats_crashing_...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, May 9, 2013 at 3:01 PM, Diederik van Liere dvanliere@wikimedia.orgwrote:
Heya,
We are receiving reports [0] that pageview numbers for a small subset of articles are significantly lower then they used to be. See
- http://stats.grok.se/en/latest90/Schizophrenia
- http://stats.grok.se/en/latest90/Cancer
- http://stats.grok.se/en/latest90/Depression_%28mood%29
(those links are for enwiki articles)
What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely.
Webstatsollector, the program that generates the data that is shown on stats.grok.se did not deduplicate counts for https and so we did expect a 50% drop. Thus, prior to disabling sending SSL traffic to udp2log we were overcounting. However, the drop is larger than 50% which means something else is going on as well.
For April 29th, 2013 for the 'http(s)://en.wikipedia.org/wiki/Cancer' article the following counts were calculated (using zcat sampled-1000.tsv.log-20130429.gz | cut -f 12 | grep " http://en.wikipedia.org/wiki/Cancer$" | wc -l) and changing the field 9 or 12 for url or referer and changing http/https
======================================================== | | direct requests | referer hits | | | (field 9) | (field 12) |
| http hits | 5 (5000) | 35 (35000) | | https hits | 0 (0) | 65 (65000) | ========================================================
(The first number is the actual observed number, the numbers in parentheses are the absolute numbers after multiplying by 1000 as that is the samping factor)
There are many https hits for the cancer article in the referer but none in the URL field, which could be an indication that the squids are not correctly logging Nginx SSL redirected requests. The reason we see so few http hits for the cancer article is obviously because Google sends people to the https version. Finally, we do see a lot of https hits in the referer, this is mostly to the upload domain and suggests that actually many people are reading this article.
I think the problem is in the data analysis.
root@gadolinium:/a/log/webrequest# mawk '{if ($9 ~ /en.wikipedia.org/wiki/Cancer$/) { print }}' sampled-1000.tsv.log | head -1
cp1007.eqiad.wmnet 458279850 2013-05-09T11:50:31.328 300 208.80.154.134 TCP_MISS/200 81359 GET http://en.wikipedia.org/wiki/Cancer CARP/10.64.0.136 text/html https://www.google.com/ 173.13.112.253 Mozilla/5.0%20(Windows%20NT%206.1;%20WOW64)%20AppleWebKit/537.31%20(KHTML,%20like%20Gecko)%20Chrome/26.0.1410.64%20Safari/537.31 en-US,en;q=0.8 -
This is the first entry for enwiki/Cancer in the current log and it's an https request referred from google as logged by squid. Squid doesn't take https requests, so you'll never see https in the request url. But note, $4 = sl1002, $12 = https://www.google.com/. This is exactly how this request should be expected to be logged from squid.
It would be better to always analyze requests as logged from the first tier. Process the nginx logs, while filtering squid logs where $4 matches any of our production subnets. The later should be done anyways for accuracy.
-A
Hey Asher,
Thanks so much for your reply! The analysis might have been incomplete; but your response regarding field 4 would match to a Nginx server reminded me of the root cause: webstatscollector does filter internal ip's and so it was no longer counting any https traffic. The fix is straightforward :)
Again, thanks a lot! D
On 10/05/13 08:01, Diederik van Liere wrote:
What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely.
It is concerning to me that Google is randomly sending traffic to the HTTPS gateway rather than us being in control of that migration process. Maybe it is time to send a rel=canonical link on all pages, like we do for uz.wikipedia.org but in the reverse direction?
-- Tim Starling
On 10/05/13 08:01, Diederik van Liere wrote:
What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely.
It is concerning to me that Google is randomly sending traffic to the HTTPS gateway rather than us being in control of that migration process. Maybe it is time to send a rel=canonical link on all pages, like we do for uz.wikipedia.org but in the reverse direction?
There are two bugs in Bugzilla that track this feature: * Canonical URL on all content pages ( https://bugzilla.wikimedia.org/show_bug.cgi?id=25882) * Implement a way to set a canonical url in OutputPage ( https://bugzilla.wikimedia.org/show_bug.cgi?id=28602) D