I am trying to migrate Limn graphs to our own handling. Currently Zero graphs are generated as Limn dashboards, and after I applied this filter (taken from the HQL for counting article pageviews), i got matching (about 10% discrepancy) between our graphs and limn. Yet, one partner has discrepancy of 10 times, and I would like to see where that mismatch comes from. I looked at https://github.com/wikimedia/analytics-wp-zero but it seems there is other code that's missing from that repo. Any suggestions are welcome. Thanks!
WHERE webrequest_source IN ('text', 'mobile') AND year=${year} AND month=${month} AND day=${day} AND x_analytics LIKE '%zero=%' AND SUBSTR(uri_path, 1, 6) = '/wiki/' AND ( ( SUBSTR(ip, 1, 9) != '10.128.0.' AND SUBSTR(ip, 1, 11) NOT IN ( '208.80.152.', '208.80.153.', '208.80.154.', '208.80.155.', '91.198.174.' ) ) OR x_forwarded_for != '-' ) AND SUBSTR(uri_path, 1, 31) != '/wiki/Special:CentralAutoLogin/' AND http_status NOT IN ( '301', '302', '303' ) AND uri_host RLIKE '^[A-Za-z0-9-]+(\.(zero|m))?\.[a-z]*\.org$' AND NOT (SPLIT(TRANSLATE(SUBSTR(uri_path, 7), ' ', '_'), '#')[0] RLIKE '^[Uu]ndefined$')
Hi Yuri,
On Mon, Nov 17, 2014 at 10:28:50PM +0200, Yuri Astrakhan wrote:
I looked at https://github.com/wikimedia/analytics-wp-zero but it seems there is other code that's missing from that repo. Any suggestions are welcome.
The remaining code is in
https://gerrit.wikimedia.org/r/#/admin/projects/analytics/kraken
.
But as we discussed before, we had to do quite some firefighting around the relevant parts, and the code has not fully flowed back into the master branches.
I pushed the deployed code to https://git.wikimedia.org/tree/analytics%2Fkraken.git/refs%2Fheads%2Fqchris%... and https://git.wikimedia.org/tree/analytics%2Fwp-zero.git/refs%2Fheads%2Fqchris... so you can look yourself. But be warned that the code has many issues, as we have discussed many times before.
And since you said that you'll handle the graphs, the kraken code did not see much love.
On the processing side, the main entrance points are: https://git.wikimedia.org/blob/analytics%2Fkraken.git/91a5fc00acbdb13f81cd42... https://git.wikimedia.org/blob/analytics%2Fkraken.git/91a5fc00acbdb13f81cd42...
On the definition side, I guess the most relevant part is
http://git.wikimedia.org/blob/analytics%2Fkraken.git/91a5fc00acbdb13f81cd421...
WHERE webrequest_source IN ('text', 'mobile')[...]
This HiveQL snippet you pasted looks like it's coming from the Hive re-implementation webstatscollector pageview definition. However, the webstatscollector pageview definition, does not match Kraken's pageview definitions.
So I would not expect them to agree in first place. Difference of 10-times is pretty bad. But both definitions are bad in so many ways. So it can easily happen that one of the shortcomings of one of the two definitions just skews numbers considerably.
I hope the above code pointers get you started.
Have fun, Christian
[1] https://git.wikimedia.org/blob/analytics%2Frefinery.git/fd54d68a586f081cc9a7...