Heya,

We are receiving reports [0] that pageview numbers for a small subset of articles are significantly lower then they used to be. See
* http://stats.grok.se/en/latest90/Schizophrenia
* http://stats.grok.se/en/latest90/Cancer
* http://stats.grok.se/en/latest90/Depression_%28mood%29 
(those links are for enwiki articles)

What these articles have in common is that Google has indexed them using the https protocol. This in combination with us no longer sending the Nginx SSL traffic to udp2log (this happend IIRC in the week of March 25 - March 31, 2013) explains a part of the drop but not entirely. 

Webstatsollector, the program that generates the data that is shown on stats.grok.se did not deduplicate counts for https and so we did expect a 50% drop. Thus, prior to disabling sending SSL traffic to udp2log we were overcounting. However, the drop is larger than 50% which means something else is going on as well.

For April 29th, 2013 for the 'http(s)://en.wikipedia.org/wiki/Cancer' article the following counts were calculated (using zcat sampled-1000.tsv.log-20130429.gz | cut -f 12 | grep "http://en.wikipedia.org/wiki/Cancer$" | wc -l) and changing the field 9 or 12 for url or referer and changing http/https

========================================================
|            | direct requests   |  referer hits       |
|            |   (field 9)       |   (field 12)        |
--------------------------------------------------------
| http hits  |        5  (5000)  |       35 (35000)    |
| https hits |        0  (0)     |       65 (65000)    |
========================================================

(The first number is the actual observed number, the numbers in parentheses are the absolute numbers after multiplying by 1000 as that is the samping factor)

There are many https hits for the cancer article in the referer but none in the URL field, which could be an indication that the squids are not correctly logging Nginx SSL redirected requests. The reason we see so few http hits for the cancer article is obviously because Google sends people to the https version. Finally, we do see a lot of https hits in the referer, this is mostly to the upload domain and suggests that actually many people are reading this article.

Solutions

There are at least two different solutions to solve this problem:
1) Stop Google to index https articles by adding a <link rel="canonical" href="http://*.wikipedia.org/wiki/Foo" /> to every page. I belief this could be done in Mediawiki. The problem is similar to Google indexing the articles on the .m. domains and we resolved that as well.

2) Make sure that https hits are properly logged by Squid (assuming that is the problem).

I am sure there are other possible solutions, including setting the X-Proto-For header so please chime in if you disagree with the diagnosis or have an alternative solution.



Best,

Diederik

[0]
*http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28miscellaneous%29/Archive_42#Page_view_stats_declining_from_22_April
*http://en.wikipedia.org/wiki/Wikipedia:VPT#Page_view_stats_declining_from_22_April
*http://en.wikipedia.org/wiki/User_talk:Eloquence#View_stats_crashing_on_some_pages
*User_talk:West.andrew.g#Page_view_stats_crashing_on_some.2C_but_not_all.2C_articles
*http://en.wikipedia.org/wiki/User_talk:Jimbo_Wales#Page_view_stats_crashing_on_some.2C_but_not_all.2C_articles