Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
The new one.
The code that generates it -
- https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho... - https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah, I see the problem; someone patched it and never documented it.
We have documentation at https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition (https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote:
The new one.
The code that generates it -
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to
document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Ah, I see the problem; someone patched it and never documented it.
We have documentation at https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition ( https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403... ) to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote:
The new one.
The code that generates it -
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Ah, I see the problem; someone patched it and never documented it.
We have documentation at https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition
(https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote:
The new one.
The code that generates it -
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Ah, I see the problem; someone patched it and never documented it.
We have documentation at https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition
(https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote:
The new one.
The code that generates it -
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Ah, I see the problem; someone patched it and never documented it.
We have documentation at https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition
(https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote:
The new one.
The code that generates it -
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Is the pageviews_hourly table meant to contain pageviews according to the new or old definition? If old, where can I find aggregates for the new one?
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list is the right thing to do :)
On our end, we could surely do a better job to communicate changes in the pageview definition code for anybody interested to review/comment/ask for documentation. Emails have been sent regularly about updates on the analytics list, except in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org wrote:
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without
asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I
didn't
know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
Ah, I see the problem; someone patched it and never documented it.
We have documentation at
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition
(
https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403... )
to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote:
The new one.
The code that generates it -
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes <
okeyes@wikimedia.org>
wrote: > > Is the pageviews_hourly table meant to contain pageviews according
to
> the new or old definition? If old, where can I find aggregates for
the
> new one? > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- --Madhu :)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 17 August 2015 at 13:48, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list is the right thing to do :)
Indeed; my point is not about issues reported upstream. My point is that there appears to currently be absolutely no work done to take this (org-level, highest possible priority) KPI and evaluate it every month or ever N days to make sure that, even with the gradual accretion of changes to the input data, it is still extracting what we want. It is down to user-reported issues. The problem with this approach is that after 90 days it is impossible to rerun the data; if there is a bug breaking the logs, and it takes more than 90 days to discover it, those logs are simply broken.
In addition, discovering these issues requires a very granular understanding of what the pageviews logs are meant to be capturing that most customers simply will not have. It worked in this case primarily because the customer actually /wrote/ the definition ;p.
For public transparency: Joseph and I talked on IRC and will be working on ways to validate data and detect these kinds of regressions in advance.
On our end, we could surely do a better job to communicate changes in the pageview definition code for anybody interested to review/comment/ask for documentation. Emails have been sent regularly about updates on the analytics list, except in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org wrote:
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Ah, I see the problem; someone patched it and never documented it.
We have documentation at
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters of the generalised filters. There is also a log, on https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the pageview definition.
The intent behind both the transparent definition and the log is to ensure that we know what is going /in/ the definition.
In this case, somebody has patched the definition
(https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) to include traffic from outreach.wikimedia.org - a site that was very deliberately and very explicitly excluded from the definition as it was written.
There is no explanation of why this change was made, there is no documentation of this change even existing outside the actual Java.... can someone please explain what this is for, and update all the documentation to reflect that? And then could people be very, very clear in future that it is expected there be a log of alterations you make to high-level KPIs beyond the, you know, commit logs.
On 16 August 2015 at 14:32, Madhumitha Viswanathan mviswanathan@wikimedia.org wrote: > The new one. > > The code that generates it - > > - > > > https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho... > - > > > https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h... > > > > On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes > okeyes@wikimedia.org > wrote: >> >> Is the pageviews_hourly table meant to contain pageviews according >> to >> the new or old definition? If old, where can I find aggregates for >> the >> new one? >> >> -- >> Oliver Keyes >> Count Logula >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > --Madhu :) > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
BTW, Christian foresaw this issue and wrote this: https://github.com/wikimedia/analytics-refinery-source/tree/master/guard
It should be useable for pageviews too, I think. For this issue, a guard that made sure that outreach.wikimedia.org never appeared would have been an error.
On Aug 17, 2015, at 14:45, Oliver Keyes okeyes@wikimedia.org wrote:
On 17 August 2015 at 13:48, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list is the right thing to do :)
Indeed; my point is not about issues reported upstream. My point is that there appears to currently be absolutely no work done to take this (org-level, highest possible priority) KPI and evaluate it every month or ever N days to make sure that, even with the gradual accretion of changes to the input data, it is still extracting what we want. It is down to user-reported issues. The problem with this approach is that after 90 days it is impossible to rerun the data; if there is a bug breaking the logs, and it takes more than 90 days to discover it, those logs are simply broken.
In addition, discovering these issues requires a very granular understanding of what the pageviews logs are meant to be capturing that most customers simply will not have. It worked in this case primarily because the customer actually /wrote/ the definition ;p.
For public transparency: Joseph and I talked on IRC and will be working on ways to validate data and detect these kinds of regressions in advance.
On our end, we could surely do a better job to communicate changes in the pageview definition code for anybody interested to review/comment/ask for documentation. Emails have been sent regularly about updates on the analytics list, except in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org wrote:
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oliver,
It was a mistake from me to add the 'outreach' subdomain without asking you.
From a documentation perspective, the analytics team uses that place to document changes: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I didn't know about up-to-date documentation you sent.
Tickets have been created to both correct the bug and update the documentation pages.
Joseph
On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org wrote: > > Ah, I see the problem; someone patched it and never documented it. > > We have documentation at > > https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters > of the generalised filters. There is also a log, on > https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the > pageview definition. > > The intent behind both the transparent definition and the log is to > ensure that we know what is going /in/ the definition. > > In this case, somebody has patched the definition > > > (https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) > to include traffic from outreach.wikimedia.org - a site that was very > deliberately and very explicitly excluded from the definition as it > was written. > > There is no explanation of why this change was made, there is no > documentation of this change even existing outside the actual > Java.... > can someone please explain what this is for, and update all the > documentation to reflect that? And then could people be very, very > clear in future that it is expected there be a log of alterations you > make to high-level KPIs beyond the, you know, commit logs. > > On 16 August 2015 at 14:32, Madhumitha Viswanathan > mviswanathan@wikimedia.org wrote: >> The new one. >> >> The code that generates it - >> >> - >> >> >> https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho... >> - >> >> >> https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h... >> >> >> >> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes >> okeyes@wikimedia.org >> wrote: >>> >>> Is the pageviews_hourly table meant to contain pageviews according >>> to >>> the new or old definition? If old, where can I find aggregates for >>> the >>> new one? >>> >>> -- >>> Oliver Keyes >>> Count Logula >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> >> -- >> --Madhu :) >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This seems perfect. Is it currently used?
On 17 August 2015 at 18:03, Andrew Otto aotto@wikimedia.org wrote:
BTW, Christian foresaw this issue and wrote this: https://github.com/wikimedia/analytics-refinery-source/tree/master/guard
It should be useable for pageviews too, I think. For this issue, a guard that made sure that outreach.wikimedia.org never appeared would have been an error.
On Aug 17, 2015, at 14:45, Oliver Keyes okeyes@wikimedia.org wrote:
On 17 August 2015 at 13:48, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list is the right thing to do :)
Indeed; my point is not about issues reported upstream. My point is that there appears to currently be absolutely no work done to take this (org-level, highest possible priority) KPI and evaluate it every month or ever N days to make sure that, even with the gradual accretion of changes to the input data, it is still extracting what we want. It is down to user-reported issues. The problem with this approach is that after 90 days it is impossible to rerun the data; if there is a bug breaking the logs, and it takes more than 90 days to discover it, those logs are simply broken.
In addition, discovering these issues requires a very granular understanding of what the pageviews logs are meant to be capturing that most customers simply will not have. It worked in this case primarily because the customer actually /wrote/ the definition ;p.
For public transparency: Joseph and I talked on IRC and will be working on ways to validate data and detect these kinds of regressions in advance.
On our end, we could surely do a better job to communicate changes in the pageview definition code for anybody interested to review/comment/ask for documentation. Emails have been sent regularly about updates on the analytics list, except in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org wrote:
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org wrote:
Excellent; thank you.
On 17 August 2015 at 04:42, Joseph Allemandou jallemandou@wikimedia.org wrote: > Oliver, > > It was a mistake from me to add the 'outreach' subdomain without > asking you. > > From a documentation perspective, the analytics team uses that place > to > document changes: > https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and I > didn't > know about up-to-date documentation you sent. > > Tickets have been created to both correct the bug and update the > documentation pages. > > Joseph > > > > On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes okeyes@wikimedia.org > wrote: >> >> Ah, I see the problem; someone patched it and never documented it. >> >> We have documentation at >> >> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >> of the generalised filters. There is also a log, on >> https://meta.wikimedia.org/wiki/Research:Page_view, of changes to the >> pageview definition. >> >> The intent behind both the transparent definition and the log is to >> ensure that we know what is going /in/ the definition. >> >> In this case, somebody has patched the definition >> >> >> (https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403...) >> to include traffic from outreach.wikimedia.org - a site that was very >> deliberately and very explicitly excluded from the definition as it >> was written. >> >> There is no explanation of why this change was made, there is no >> documentation of this change even existing outside the actual >> Java.... >> can someone please explain what this is for, and update all the >> documentation to reflect that? And then could people be very, very >> clear in future that it is expected there be a log of alterations you >> make to high-level KPIs beyond the, you know, commit logs. >> >> On 16 August 2015 at 14:32, Madhumitha Viswanathan >> mviswanathan@wikimedia.org wrote: >>> The new one. >>> >>> The code that generates it - >>> >>> - >>> >>> >>> https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho... >>> - >>> >>> >>> https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h... >>> >>> >>> >>> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes >>> okeyes@wikimedia.org >>> wrote: >>>> >>>> Is the pageviews_hourly table meant to contain pageviews according >>>> to >>>> the new or old definition? If old, where can I find aggregates for >>>> the >>>> new one? >>>> >>>> -- >>>> Oliver Keyes >>>> Count Logula >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> >>> -- >>> --Madhu :) >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> >> -- >> Oliver Keyes >> Count Logula >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Tilman, to answer your question, the presentation of analytics at Monthly Metrics Meetings will change month to month. Next month I am on vacation so I have asked Jon to present something. I'm assuming it will have Pageviews and be readership focused - it's up to Jon.
On Mon, Aug 17, 2015 at 4:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
This seems perfect. Is it currently used?
On 17 August 2015 at 18:03, Andrew Otto aotto@wikimedia.org wrote:
BTW, Christian foresaw this issue and wrote this: https://github.com/wikimedia/analytics-refinery-source/tree/master/guard
It should be useable for pageviews too, I think. For this issue, a
guard that made sure that outreach.wikimedia.org never appeared would have been an error.
On Aug 17, 2015, at 14:45, Oliver Keyes okeyes@wikimedia.org wrote:
On 17 August 2015 at 13:48, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list is
the
right thing to do :)
Indeed; my point is not about issues reported upstream. My point is that there appears to currently be absolutely no work done to take this (org-level, highest possible priority) KPI and evaluate it every month or ever N days to make sure that, even with the gradual accretion of changes to the input data, it is still extracting what we want. It is down to user-reported issues. The problem with this approach is that after 90 days it is impossible to rerun the data; if there is a bug breaking the logs, and it takes more than 90 days to discover it, those logs are simply broken.
In addition, discovering these issues requires a very granular understanding of what the pageviews logs are meant to be capturing that most customers simply will not have. It worked in this case primarily because the customer actually /wrote/ the definition ;p.
For public transparency: Joseph and I talked on IRC and will be working on ways to validate data and detect these kinds of regressions in advance.
On our end, we could surely do a better job to communicate changes in
the
pageview definition code for anybody interested to review/comment/ask
for
documentation. Emails have been sent regularly about updates on the analytics list,
except
in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
You should also note that donate-wiki pageviews are making it into the counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org
wrote:
Just to clarify; there is no need to ask me before making changes (obviously I find my approval for pageviews changes being sought incredibly flattering, but I am not the only person involved in this project ;p). What I'm more driving towards is directly informing customers when the definition is adapted.
On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org
wrote:
> Excellent; thank you. > > On 17 August 2015 at 04:42, Joseph Allemandou > jallemandou@wikimedia.org wrote: >> Oliver, >> >> It was a mistake from me to add the 'outreach' subdomain without >> asking you. >> >> From a documentation perspective, the analytics team uses that
place
>> to >> document changes: >> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest and
I
>> didn't >> know about up-to-date documentation you sent. >> >> Tickets have been created to both correct the bug and update the >> documentation pages. >> >> Joseph >> >> >> >> On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes <
okeyes@wikimedia.org>
>> wrote: >>> >>> Ah, I see the problem; someone patched it and never documented it. >>> >>> We have documentation at >>> >>>
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>> of the generalised filters. There is also a log, on >>> https://meta.wikimedia.org/wiki/Research:Page_view, of changes
to the
>>> pageview definition. >>> >>> The intent behind both the transparent definition and the log is
to
>>> ensure that we know what is going /in/ the definition. >>> >>> In this case, somebody has patched the definition >>> >>> >>> (
https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403... )
>>> to include traffic from outreach.wikimedia.org - a site that was
very
>>> deliberately and very explicitly excluded from the definition as
it
>>> was written. >>> >>> There is no explanation of why this change was made, there is no >>> documentation of this change even existing outside the actual >>> Java.... >>> can someone please explain what this is for, and update all the >>> documentation to reflect that? And then could people be very, very >>> clear in future that it is expected there be a log of alterations
you
>>> make to high-level KPIs beyond the, you know, commit logs. >>> >>> On 16 August 2015 at 14:32, Madhumitha Viswanathan >>> mviswanathan@wikimedia.org wrote: >>>> The new one. >>>> >>>> The code that generates it - >>>> >>>> - >>>> >>>> >>>>
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
>>>> - >>>> >>>> >>>>
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
>>>> >>>> >>>> >>>> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes >>>> okeyes@wikimedia.org >>>> wrote: >>>>> >>>>> Is the pageviews_hourly table meant to contain pageviews
according
>>>>> to >>>>> the new or old definition? If old, where can I find aggregates
for
>>>>> the >>>>> new one? >>>>> >>>>> -- >>>>> Oliver Keyes >>>>> Count Logula >>>>> Wikimedia Foundation >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>>> >>>> >>>> -- >>>> --Madhu :) >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >>> >>> >>> -- >>> Oliver Keyes >>> Count Logula >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> >> -- >> Joseph Allemandou >> Data Engineer @ Wikimedia Foundation >> IRC: joal >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
To add a bit:
First, regarding to the initial technical discussion about the pageview definition used for pageview_hourly:
It now seems that apart from Outreach wiki, it differs from the earlier Cube v0.5 data also regarding the inclusion of mediawiki.org and wikimediafoundation.org, see https://phabricator.wikimedia.org/T108925 (That bug is about projectview_hourly instead of pageview_hourly, but I understand those two only differ by aggregation level.)
Those discrepancies (still not resolved by the three differing domains) came up while preparing the metrics scorecard for the WMF quarterly report last month. They prevented the inclusion of the usual quarter-over-quarter / year-over-year trend there, and I worry that the upcoming reports for Reading might need to do without that too.
Which brings me to the ownership/QA issue raised by Oliver, where I'd just like to underline the potential value of this kind of proactive regression testing. In the last few years I have filed many error reports regarding Wikistats and reportcard.wmflabs.org, as my work on the quarterly (formerly monthly) WMF reports meant that I was one of the apparently very few people who looked at the core metrics there regularly. An example - regarding the total active editors number instead of pageviews - is https://phabricator.wikimedia.org/T87738 which was filed in January and is still not fully resolved despite significant investigation work. That issue was about historical numbers dropping retroactively by implausibly large amounts, which could totally have been flagged earlier by an automated guard. And I know that other issues were caught by ErikZ's proactive vigilance, which will need to find an equivalent in the upcoming replacement for Wikistats. Recall we are talking about the core metrics of an organization that spends about a million dollars a week.
Back to the initial question on where to find information about the definition of pageview_hourly, I happened to notice that the team had actually already started some quite nice documentation pages at https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly etc., I just added some information from this thread there and look forward to the fuller integration of the pageview definition documentation.
On Mon, Aug 17, 2015 at 5:18 PM, Kevin Leduc kevin@wikimedia.org wrote:
Tilman, to answer your question, the presentation of analytics at Monthly Metrics Meetings will change month to month. Next month I am on vacation so I have asked Jon to present something. I'm assuming it will have Pageviews and be readership focused - it's up to Jon.
(in case this caused confusion for other readers: this question had been asked offlist)
Thanks! The reason I asked was to know whether there will be another regular consumer of this pageview data again. Elderly citizens among us will recall that the Metrics meeting used to include a look at core reader and editor metrics each month. But I understand we are not going back to that format.
On Mon, Aug 17, 2015 at 4:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
This seems perfect. Is it currently used?
On 17 August 2015 at 18:03, Andrew Otto aotto@wikimedia.org wrote:
BTW, Christian foresaw this issue and wrote this:
https://github.com/wikimedia/analytics-refinery-source/tree/master/guard
It should be useable for pageviews too, I think. For this issue, a
guard that made sure that outreach.wikimedia.org never appeared would have been an error.
On Aug 17, 2015, at 14:45, Oliver Keyes okeyes@wikimedia.org wrote:
On 17 August 2015 at 13:48, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
Hey Oliver,
The analytics team is responsible for the pageview definition. When finding issues, sending an email to the analytics mailing list
is the
right thing to do :)
Indeed; my point is not about issues reported upstream. My point is that there appears to currently be absolutely no work done to take this (org-level, highest possible priority) KPI and evaluate it every month or ever N days to make sure that, even with the gradual accretion of changes to the input data, it is still extracting what we want. It is down to user-reported issues. The problem with this approach is that after 90 days it is impossible to rerun the data; if there is a bug breaking the logs, and it takes more than 90 days to discover it, those logs are simply broken.
In addition, discovering these issues requires a very granular understanding of what the pageviews logs are meant to be capturing that most customers simply will not have. It worked in this case primarily because the customer actually /wrote/ the definition ;p.
For public transparency: Joseph and I talked on IRC and will be working on ways to validate data and detect these kinds of regressions in advance.
On our end, we could surely do a better job to communicate changes in
the
pageview definition code for anybody interested to review/comment/ask
for
documentation. Emails have been sent regularly about updates on the analytics list,
except
in the past few month. We shall get back to that good habit and send notifications with explanations of the changes.
Joseph
On Mon, Aug 17, 2015 at 5:15 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
You should also note that donate-wiki pageviews are making it into
the
counts (again, the definition was designed to exclude these).
Whose job is it to review pageviews and update the definition when issues are found?
On 17 August 2015 at 10:32, Oliver Keyes okeyes@wikimedia.org
wrote:
> Just to clarify; there is no need to ask me before making changes > (obviously I find my approval for pageviews changes being sought > incredibly flattering, but I am not the only person involved in this > project ;p). What I'm more driving towards is directly informing > customers when the definition is adapted. > > On 17 August 2015 at 10:31, Oliver Keyes okeyes@wikimedia.org
wrote:
>> Excellent; thank you. >> >> On 17 August 2015 at 04:42, Joseph Allemandou >> jallemandou@wikimedia.org wrote: >>> Oliver, >>> >>> It was a mistake from me to add the 'outreach' subdomain without >>> asking you. >>> >>> From a documentation perspective, the analytics team uses that
place
>>> to >>> document changes: >>> https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
and I
>>> didn't >>> know about up-to-date documentation you sent. >>> >>> Tickets have been created to both correct the bug and update the >>> documentation pages. >>> >>> Joseph >>> >>> >>> >>> On Sun, Aug 16, 2015 at 8:47 PM, Oliver Keyes <
okeyes@wikimedia.org>
>>> wrote: >>>> >>>> Ah, I see the problem; someone patched it and never documented
it.
>>>> >>>> We have documentation at >>>> >>>>
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>> of the generalised filters. There is also a log, on >>>> https://meta.wikimedia.org/wiki/Research:Page_view, of changes
to the
>>>> pageview definition. >>>> >>>> The intent behind both the transparent definition and the log is
to
>>>> ensure that we know what is going /in/ the definition. >>>> >>>> In this case, somebody has patched the definition >>>> >>>> >>>> (
https://github.com/wikimedia/analytics-refinery-source/commit/cc0b6ed7e4f403... )
>>>> to include traffic from outreach.wikimedia.org - a site that
was very
>>>> deliberately and very explicitly excluded from the definition as
it
>>>> was written. >>>> >>>> There is no explanation of why this change was made, there is no >>>> documentation of this change even existing outside the actual >>>> Java.... >>>> can someone please explain what this is for, and update all the >>>> documentation to reflect that? And then could people be very,
very
>>>> clear in future that it is expected there be a log of
alterations you
>>>> make to high-level KPIs beyond the, you know, commit logs. >>>> >>>> On 16 August 2015 at 14:32, Madhumitha Viswanathan >>>> mviswanathan@wikimedia.org wrote: >>>>> The new one. >>>>> >>>>> The code that generates it - >>>>> >>>>> - >>>>> >>>>> >>>>>
https://github.com/wikimedia/analytics-refinery/blob/master/hive/pageview/ho...
>>>>> - >>>>> >>>>> >>>>>
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/pageview/h...
>>>>> >>>>> >>>>> >>>>> On Sun, Aug 16, 2015 at 11:01 AM, Oliver Keyes >>>>> okeyes@wikimedia.org >>>>> wrote: >>>>>> >>>>>> Is the pageviews_hourly table meant to contain pageviews
according
>>>>>> to >>>>>> the new or old definition? If old, where can I find aggregates
for
>>>>>> the >>>>>> new one? >>>>>> >>>>>> -- >>>>>> Oliver Keyes >>>>>> Count Logula >>>>>> Wikimedia Foundation >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> --Madhu :) >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>> >>>> >>>> >>>> -- >>>> Oliver Keyes >>>> Count Logula >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> >>> -- >>> Joseph Allemandou >>> Data Engineer @ Wikimedia Foundation >>> IRC: joal >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> >> -- >> Oliver Keyes >> Count Logula >> Wikimedia Foundation > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Count Logula Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Indeed. For transparency, Joseph, Andrew and myself had a meeting late last week to talk about how we handle these issues. The resolution was to go for positive, as well as negative, checking, probably using Christian's "guard" framework.
So, for example, suppose we want to make sure projects are what we want; one way is to have unit tests that contain things we do and don't want and to make sure they all pass on example data. But in addition we can build a list of /all/ the projects we want and have the pageviews_hourly table run through that list once every N, issuing an error if there are projects that appear that aren't in the list. Sometimes they will be false positives, but that is the advantage of positive checks - when it is wrong it tells you. When unit tests are wrong they don't always ;)
On 23 August 2015 at 04:34, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Tilman Bayer, 22/08/2015 19:33:
And I know that other issues were caught by ErikZ's proactive vigilance, which will need to find an equivalent in the upcoming replacement for Wikistats.
+1
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Whose job is it to review pageviews and update the definition when issues are found?
I see the thread evolved a bit today. But I'll note this for people going through the archives:
There seem to be a few levels of review of pageviews. There's been stuff for the monthly metrics meetings (e.g., earlier this month Kevin Leduc ran some reporting). Tilman Bayer is also working on some regular reports for Reading; he has generated quarterly scorecards around this sort of data in the past, too. Reading is a customer of the data so to speak. I think a lot of us are doing ad hoc lookups from time to time.
For reporting issues my working understanding is if someone notices an issue we should submit a bug against #analytics in Phabricator, with Analytics implementing updates as needed (as Oliver noted in a later report, how to systematize review is a question he and Joseph will look to answer).
On 17 August 2015 at 16:20, Adam Baso abaso@wikimedia.org wrote:
Whose job is it to review pageviews and update the definition when issues are found?
I see the thread evolved a bit today. But I'll note this for people going through the archives:
There seem to be a few levels of review of pageviews. There's been stuff for the monthly metrics meetings (e.g., earlier this month Kevin Leduc ran some reporting). Tilman Bayer is also working on some regular reports for Reading; he has generated quarterly scorecards around this sort of data in the past, too. Reading is a customer of the data so to speak. I think a lot of us are doing ad hoc lookups from time to time.
Yeah, I wasn't talking about review in the sense of using it, I was talking about review in the sense of actively looking for issues.
For reporting issues my working understanding is if someone notices an issue we should submit a bug against #analytics in Phabricator, with Analytics implementing updates as needed (as Oliver noted in a later report, how to systematize review is a question he and Joseph will look to answer).
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics