On Thu, May 9, 2013 at 3:01 PM, Diederik van Liere
<dvanliere(a)wikimedia.org>wrote;wrote:
Heya,
We are receiving reports [0] that pageview numbers for a small subset of
articles are significantly lower then they used to be. See
*
http://stats.grok.se/en/latest90/Schizophrenia
*
http://stats.grok.se/en/latest90/Cancer
*
http://stats.grok.se/en/latest90/Depression_%28mood%29
(those links are for enwiki articles)
What these articles have in common is that Google has indexed them using
the https protocol. This in combination with us no longer sending the Nginx
SSL traffic to udp2log (this happend IIRC in the week of March 25 - March
31, 2013) explains a part of the drop but not entirely.
Webstatsollector, the program that generates the data that is shown on
stats.grok.se did not deduplicate counts for https and so we did expect a
50% drop. Thus, prior to disabling sending SSL traffic to udp2log we were
overcounting. However, the drop is larger than 50% which means something
else is going on as well.
For April 29th, 2013 for the 'http(s)://en.wikipedia.org/wiki/Cancer'
article the following counts were calculated (using zcat
sampled-1000.tsv.log-20130429.gz | cut -f 12 | grep "
http://en.wikipedia.org/wiki/Cancer$" | wc -l) and changing the field 9
or 12 for url or referer and changing http/https
========================================================
| | direct requests | referer hits |
| | (field 9) | (field 12) |
--------------------------------------------------------
| http hits | 5 (5000) | 35 (35000) |
| https hits | 0 (0) | 65 (65000) |
========================================================
(The first number is the actual observed number, the numbers in
parentheses are the absolute numbers after multiplying by 1000 as that is
the samping factor)
There are many https hits for the cancer article in the referer but none
in the URL field, which could be an indication that the squids are not
correctly logging Nginx SSL redirected requests. The reason we see so few
http hits for the cancer article is obviously because Google sends people
to the https version. Finally, we do see a lot of https hits in the
referer, this is mostly to the upload domain and suggests that actually
many people are reading this article.
I think the problem is in the data analysis.
root@gadolinium:/a/log/webrequest# mawk '{if ($9 ~
/en.wikipedia.org\/wiki\/Cancer$/)
{ print }}' sampled-1000.tsv.log | head -1
cp1007.eqiad.wmnet 458279850 2013-05-09T11:50:31.328 300
208.80.154.134 TCP_MISS/200 81359 GET
http://en.wikipedia.org/wiki/Cancer CARP/10.64.0.136
text/html
https://www.google.com/ 173.13.112.253
Mozilla/5.0%20(Windows%20NT%206.1;%20WOW64)%20AppleWebKit/537.31%20(KHTML,%20like%20Gecko)%20Chrome/26.0.1410.64%20Safari/537.31
en-US,en;q=0.8 -
This is the first entry for enwiki/Cancer in the current log and it's an
https request referred from google as logged by squid. Squid doesn't take
https requests, so you'll never see https in the request url. But note, $4
= sl1002, $12 =
https://www.google.com/. This is exactly how this request
should be expected to be logged from squid.
It would be better to always analyze requests as logged from the first
tier. Process the nginx logs, while filtering squid logs where $4 matches
any of our production subnets. The later should be done anyways for
accuracy.
-A