BTW, Christian foresaw this issue and wrote this: https://github.com/wikimedia/analytics-refinery-source/tree/master/guard
It should be useable for pageviews too, I think. For this issue, a guard that made sure that outreach.wikimedia.org never appeared would have been an error.
On Aug 17, 2015, at 14:45, Oliver Keyes okeyes@wikimedia.org wrote:
On 17 August 2015 at 13:48, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list is the right thing to do :)
Indeed; my point is not about issues reported upstream. My point is that there appears to currently be absolutely no work done to take this (org-level, highest possible priority) KPI and evaluate it every month or ever N days to make sure that, even with the gradual accretion of changes to the input data, it is still extracting what we want. It is down to user-reported issues. The problem with this approach is that after 90 days it is impossible to rerun the data; if there is a bug breaking the logs, and it takes more than 90 days to discover it, those logs are simply broken.
In addition, discovering these issues requires a very granular understanding of what the pageviews logs are meant to be capturing that most customers simply will not have. It worked in this case primarily because the customer actually /wrote/ the definition ;p.
For public transparency: Joseph and I talked on IRC and will be working on ways to validate data and detect these kinds of regressions in advance.
On our end, we could surely do a better job to communicate changes in the pageview definition code for anybody interested to review/comment/ask for documentation. Emails have been sent regularly about updates on the analytics list, except in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org wrote:
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote: > > Ah, I see the problem; someone patched it and never documented it. > > We have documentation at > > https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters > of the generalised filters. There is also a log, on > https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the > pageview definition. > > The intent behind both the transparent definition and the log is to > ensure that we know what is going /in/ the definition. > > In this case, somebody has patched the definition > > > (https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) > to include traffic from outreach.wikimedia.org - a site that was very > deliberately and very explicitly excluded from the definition as it > was written. > > There is no explanation of why this change was made, there is no > documentation of this change even existing outside the actual > Java.... > can someone please explain what this is for, and update all the > documentation to reflect that? And then could people be very, very > clear in future that it is expected there be a log of alterations you > make to high-level KPIs beyond the, you know, commit logs. > > On 16 August 2015 at 14:32, Madhumitha Viswanathan > mviswanathan@wikimedia.org wrote: >> The new one. >> >> The code that generates it - >> >> - >> >> >> https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho... >> - >> >> >> https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h... >> >> >> >> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes >> okeyes@wikimedia.org >> wrote: >>> >>> Is the pageviews_hourly table meant to contain pageviews according >>> to >>> the new or old definition? If old, where can I find aggregates for >>> the >>> new one? >>> >>> -- >>> Oliver Keyes >>> Count Logula >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> >> -- >> --Madhu :) >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics