Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF' was about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun, -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF' was about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views
expert.
Your concoction of legacy PV seems to suggest 'Old definition, UDF' was
about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm
shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes
Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding
run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows
the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF' was about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming
the
data that is represented by the green line in your graph. Therefore we
would
see this drop in the wikistats data that Erik referred to, but we don't.
I
think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF'
was
about 1.1B per day.
Yet
http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm
shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver
Keyes
Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be
seen at
https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it
follows
the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the
legacy
definition (good luck to whoever now has the responsibility of
explaining
why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF' was about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
I'd find it easier to believe it was an implementation problem if the UDF and hive query didn't agree. Could it be some distinction in how the subsidiary hive table is turned into stats.wikimedia.org numbers, from the "raw" count of pageviews?
In any case, this is now going somewhat beyond "Oliver, please run a quick final check on the final definition"; that check has been run and shows a pretty stable definition, without any odd day-to-day yo-yoing and a clear week/weekend pattern, which is what we expect. For additional analysis, I'd suggest either assigning someone to this task (presumably whoever is maintaining the definition now) or, of course, asking Erik if you could borrow me. I'm always happy to help out when I have the time :).
On 12 March 2015 at 18:43, Oliver Keyes okeyes@wikimedia.org wrote:
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF' was about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2.png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition. Which is fine with me, the more data points the better, but should not be confused with vetting new vs old stats. The old stats we published for many years, using which I will dub from now on the 'real legacy definition'. That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
Since it's all in your head now, and you spent a long time to get it there, I'd still recommend you finish this off and explain what has changed rather than looking to a new person to do this.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:00 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
I'd find it easier to believe it was an implementation problem if the UDF and hive query didn't agree. Could it be some distinction in how the subsidiary hive table is turned into stats.wikimedia.org numbers, from the "raw" count of pageviews?
In any case, this is now going somewhat beyond "Oliver, please run a quick final check on the final definition"; that check has been run and shows a pretty stable definition, without any odd day-to-day yo-yoing and a clear week/weekend pattern, which is what we expect. For additional analysis, I'd suggest either assigning someone to this task (presumably whoever is maintaining the definition now) or, of course, asking Erik if you could borrow me. I'm always happy to help out when I have the time :).
On 12 March 2015 at 18:43, Oliver Keyes okeyes@wikimedia.org wrote:
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote:
I'd rather see you explain this, Oliver, as our incumbent page views expert. Your concoction of legacy PV seems to suggest 'Old definition, UDF' was about 1.1B per day.
Yet http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects .htm shows 20B per month, 0.75B per day
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Thursday, March 12, 2015 19:38 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] [Technical] final pageviews QA
Hey all,
After the patches to the definition following the previous hand-coding run (see older threads) I've run a second set of tests. These can be seen at https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2 .png
There's nothing particularly shocking in the new definition; it follows the seasonal pattern that we're used to. I think we can call the new definition done, with these tweaks! It's also not as unstable as the legacy definition (good luck to whoever now has the responsibility of explaining why pageviews abruptly halved in the middle of February).
Have fun,
Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 12 March 2015 at 19:41, Erik Zachte ezachte@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition.
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
Which is fine with me, the more data points the better, but should not be confused with vetting new vs old stats. The old stats we published for many years, using which I will dub from now on the 'real legacy definition'. That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
Since it's all in your head now, and you spent a long time to get it there, I'd still recommend you finish this off and explain what has changed rather than looking to a new person to do this.
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions. Again, if Toby wishes to ask Erik if he can borrow me, that's fine too.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:00 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
I'd find it easier to believe it was an implementation problem if the UDF and hive query didn't agree. Could it be some distinction in how the subsidiary hive table is turned into stats.wikimedia.org numbers, from the "raw" count of pageviews?
In any case, this is now going somewhat beyond "Oliver, please run a quick final check on the final definition"; that check has been run and shows a pretty stable definition, without any odd day-to-day yo-yoing and a clear week/weekend pattern, which is what we expect. For additional analysis, I'd suggest either assigning someone to this task (presumably whoever is maintaining the definition now) or, of course, asking Erik if you could borrow me. I'm always happy to help out when I have the time :).
On 12 March 2015 at 18:43, Oliver Keyes okeyes@wikimedia.org wrote:
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote: > I'd rather see you explain this, Oliver, as our incumbent page > views expert. > Your concoction of legacy PV seems to suggest 'Old definition, UDF' > was > about 1.1B per day. > > Yet > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects > .htm shows 20B per month, 0.75B per day > > Erik > > -----Original Message----- > From: analytics-bounces@lists.wikimedia.org > [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of > Oliver Keyes > Sent: Thursday, March 12, 2015 19:38 > To: A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > Subject: [Analytics] [Technical] final pageviews QA > > Hey all, > > After the patches to the definition following the previous > hand-coding run (see older threads) I've run a second set of > tests. These can be seen at > https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and > https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2 > .png > > There's nothing particularly shocking in the new definition; it > follows the seasonal pattern that we're used to. I think we can > call the new definition done, with these tweaks! It's also not > as unstable as the legacy definition (good luck to whoever now > has the responsibility of explaining why pageviews abruptly > halved in the middle of February). > > > Have fun, > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
Interesting. BTW I never made any page view definition, just spent time over the years to understand the real legacy definition and its deficiencies.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:44 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
On 12 March 2015 at 19:41, Erik Zachte ezachte@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.h tm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition.
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
Which is fine with me, the more data points the better, but should not be confused with vetting new vs old stats. The old stats we published for many years, using which I will dub from now on the 'real legacy definition'. That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
Since it's all in your head now, and you spent a long time to get it there, I'd still recommend you finish this off and explain what has changed rather than looking to a new person to do this.
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions. Again, if Toby wishes to ask Erik if he can borrow me, that's fine too.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:00 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
I'd find it easier to believe it was an implementation problem if the UDF and hive query didn't agree. Could it be some distinction in how the subsidiary hive table is turned into stats.wikimedia.org numbers, from the "raw" count of pageviews?
In any case, this is now going somewhat beyond "Oliver, please run a quick final check on the final definition"; that check has been run and shows a pretty stable definition, without any odd day-to-day yo-yoing and a clear week/weekend pattern, which is what we expect. For additional analysis, I'd suggest either assigning someone to this task (presumably whoever is maintaining the definition now) or, of course, asking Erik if you could borrow me. I'm always happy to help out when I have the time :).
On 12 March 2015 at 18:43, Oliver Keyes okeyes@wikimedia.org wrote:
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?)
- I never claimed 1.1B a day, you did. Instead, it started off
as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote: > I'd rather see you explain this, Oliver, as our incumbent page > views expert. > Your concoction of legacy PV seems to suggest 'Old definition, UDF' > was > about 1.1B per day. > > Yet > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProject > s .htm shows 20B per month, 0.75B per day > > Erik > > -----Original Message----- > From: analytics-bounces@lists.wikimedia.org > [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of > Oliver Keyes > Sent: Thursday, March 12, 2015 19:38 > To: A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > Subject: [Analytics] [Technical] final pageviews QA > > Hey all, > > After the patches to the definition following the previous > hand-coding run (see older threads) I've run a second set of > tests. These can be seen at > https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and > https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_ > 2 > .png > > There's nothing particularly shocking in the new definition; > it follows the seasonal pattern that we're used to. I think we > can call the new definition done, with these tweaks! It's also > not as unstable as the legacy definition (good luck to whoever > now has the responsibility of explaining why pageviews > abruptly halved in the middle of February). > > > Have fun, > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Then, a UDF replica of the Hive implementation of the legacy definition :)
On 12 March 2015 at 19:53, Erik Zachte ezachte@wikimedia.org wrote:
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
Interesting. BTW I never made any page view definition, just spent time over the years to understand the real legacy definition and its deficiencies.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:44 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
On 12 March 2015 at 19:41, Erik Zachte ezachte@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.h tm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition.
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
Which is fine with me, the more data points the better, but should not be confused with vetting new vs old stats. The old stats we published for many years, using which I will dub from now on the 'real legacy definition'. That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
Since it's all in your head now, and you spent a long time to get it there, I'd still recommend you finish this off and explain what has changed rather than looking to a new person to do this.
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions. Again, if Toby wishes to ask Erik if he can borrow me, that's fine too.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:00 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
I'd find it easier to believe it was an implementation problem if the UDF and hive query didn't agree. Could it be some distinction in how the subsidiary hive table is turned into stats.wikimedia.org numbers, from the "raw" count of pageviews?
In any case, this is now going somewhat beyond "Oliver, please run a quick final check on the final definition"; that check has been run and shows a pretty stable definition, without any odd day-to-day yo-yoing and a clear week/weekend pattern, which is what we expect. For additional analysis, I'd suggest either assigning someone to this task (presumably whoever is maintaining the definition now) or, of course, asking Erik if you could borrow me. I'm always happy to help out when I have the time :).
On 12 March 2015 at 18:43, Oliver Keyes okeyes@wikimedia.org wrote:
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote: > > Well, I'm no longer our resident anything expert, merely /a/ > anything expert :). > > The "concoction", as you put it, comes from the > webrequest_all_sites data that is consumed by > stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. > Perhaps you could? I doubt this is a "concoction" problem given > that, as you will note if you've studied the visualisations, > both the UDF and the hive query implementation (which were > written by two different people, and code reviewed by two /more/ > people) agree that this dramatic, unexplained and untracked drop > happened. And, since we've been using the hive query > implementation for all our high-level numbers for about six > months, a bug of this magnitude in the /implementation/ of the definition would be....worrying. > > Indeed, your report says 20B per month (again, is it drawing > from the same data source as the aggregate, high-level number?) > - I never claimed 1.1B a day, you did. Instead, it started off > as approximately 1.1-1.2Bn, before dropping down to between 600m > and 700m, where it has resided ever since. That sounds, > averaged, like approximately 0.75B, no? The disadvantage of > comparing a single monthly number against a more granular dataset. > > On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote: > > I'd rather see you explain this, Oliver, as our incumbent page > > views expert. > > Your concoction of legacy PV seems to suggest 'Old definition, UDF' > > was > > about 1.1B per day. > > > > Yet > > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProject > > s .htm shows 20B per month, 0.75B per day > > > > Erik > > > > -----Original Message----- > > From: analytics-bounces@lists.wikimedia.org > > [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of > > Oliver Keyes > > Sent: Thursday, March 12, 2015 19:38 > > To: A mailing list for the Analytics Team at WMF and everybody > > who has an interest in Wikipedia and analytics. > > Subject: [Analytics] [Technical] final pageviews QA > > > > Hey all, > > > > After the patches to the definition following the previous > > hand-coding run (see older threads) I've run a second set of > > tests. These can be seen at > > https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and > > https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_ > > 2 > > .png > > > > There's nothing particularly shocking in the new definition; > > it follows the seasonal pattern that we're used to. I think we > > can call the new definition done, with these tweaks! It's also > > not as unstable as the legacy definition (good luck to whoever > > now has the responsibility of explaining why pageviews > > abruptly halved in the middle of February). > > > > > > Have fun, > > -- > > Oliver Keyes > > Research Analyst > > Wikimedia Foundation > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Oliver,
On Thu, Mar 12, 2015 at 07:44:14PM -0400, Oliver Keyes wrote:
On 12 March 2015 at 19:41, Erik Zachte ezachte@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition.
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
I am with Erik when he refutes it being “his” definition.
It is webstatscollector's definition, which originates (as far as git logs tell) from Domas in 2008 [1], and has seen some updates since from other people like Hampton and Diederik. I think all of them did great work.
Almost 7 years after its implementation, it still is the yardstick at wmf to measure page views by. That's a great achievement. Kudos!
Erik's wonderful reports /use/ data that is based on those definitions. And Christian only ported the webstatscollector C-implementation to Hive.
---------------------
Despite the efforts to update the webstatscollector pageview definition, I heard that technical limitations seem to have gotten in the way back then, and effectively MediaWiki, the WMF-hosted wikis and the shape of the corresponding request-stream changed more often and more heavily than the webstatscollector's definition saw updates. Hence, now that technical limitations are gone, there is need to overhaul the pageview definition.
From my point of view, the numbers computed by the webstatscollector pageview definition and those computed by the overhauled pageview definition need not agree.
But with the webstatscollector pageview definition being the yardstick ... having an understanding within the organization where/why/how those numbers differ would not hurt.
YMMV.
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions.
I have to admit that if you're not interested in doing QA, then the thread's subject of “final pageviews QA” mislead me. I adjusted accordingly.
Have fun, Christian
[1] https://git.wikimedia.org/commit/analytics%2Fwebstatscollector.git/7617da88b...
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions.
I have to admit that if you're not interested in doing QA, then the thread's subject of “final pageviews QA” mislead me. I adjusted accordingly.
Gentlemen, gentlemen :) I love you both, I think this is a syntax confusion. I believe what was meant was "final QA" as in Oliver's last QA of this data before he can focus more fully on his new role. Not "final QA" as in the last time the pageview definition needs to be QA-ed. I think the pageview definition will cease needing QA when our site stops changing, at which point we'll have more problems than the pageview definition :)
So the new title is more correct regardless, hear hear for the ever more correct pageview definition! May it yardstick-ize many yards.
Hi Oliver,
On Fri, Mar 13, 2015 at 09:28:30AM -0400, Dan Andreescu wrote:
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions.
I have to admit that if you're not interested in doing QA, then the thread's subject of “final pageviews QA” mislead me. I adjusted accordingly.
Gentlemen, gentlemen :) I love you both, I think this is [...]
I guess Dan's response to my above response means that the for me obvious tongue-in-cheek conotation did not convey very well :-(
Oliver, if you felt attacked, I am sorry. I did not mean to.
Have fun, Christian
On February 11, https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf16#CentralNotice was deployed. The deprecation/decrease of Special:BannerRandom and Special:RecordImpression can easily justify a decrease of hundreds millions requests. Probably wikistats was already filtering them.
Erik Zachte, 13/03/2015 00:41:
That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
I agree that what's needed here is a graph comparing current wikistats (and reportcard?) to future output with the new figures. The new definition is not final until it's "live". :)
I have troubles understanding the current and expected distribution pipeline of pageviews data and figures. It would be nice to have an overview somewhere, even just a list of links to the different steps. https://www.mediawiki.org/wiki/Analytics/Pageviews/Webstatscollector
Nemo
I have troubles understanding the current and expected distribution pipeline of pageviews data and figures. It would be nice to have an overview somewhere, even just a list of links to the different steps. https://www.mediawiki.org/wiki/Analytics/Pageviews/Webstatscollector
This might help a little bit: https://wikitech.wikimedia.org/wiki/Category:Data_stream
On Mar 13, 2015, at 13:01, Federico Leva (Nemo) nemowiki@gmail.com wrote:
On February 11, https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf16#CentralNotice was deployed. The deprecation/decrease of Special:BannerRandom and Special:RecordImpression can easily justify a decrease of hundreds millions requests. Probably wikistats was already filtering them.
Erik Zachte, 13/03/2015 00:41:
That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
I agree that what's needed here is a graph comparing current wikistats (and reportcard?) to future output with the new figures. The new definition is not final until it's "live". :)
I have troubles understanding the current and expected distribution pipeline of pageviews data and figures. It would be nice to have an overview somewhere, even just a list of links to the different steps. https://www.mediawiki.org/wiki/Analytics/Pageviews/Webstatscollector
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Oliver,
On Thu, Mar 12, 2015 at 07:00:07PM -0400, Oliver Keyes wrote:
And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38.
(Assuming "hive query" means the Hive implementation of the webstatscollector pageview definition.)
I challenge that your comparing apples to apples.
Taking a quick (and known to not necessarily be fully exact) shot at reproducing numbers, I can basically verify Erik's numbers [1][2] from the above URL.
So in contrast to what you claim, Erik's reports and the Hive-implementation of webstatscollector agree.
Since you said “monthly aggregates”, but referenced to a “normalized” report of Erik, I somewhat get the feeling you're comparing apples to oranges.
So (I know I am sounding like a broken record), please Oliver instead of claiming things without giving us a chance to reproduce, show how you ended up with your numbers and conclusions.
Have fun, Christian
P.S.:
Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
Oh, it is made clear. Search for “Archived input files” on that page that you linked :-) (And manually filter to the projects you care about and aggregate to the period of interest)
If you also challenge the way Erik arrives at those input files, you can look at the projectcounts files of pagecounts-raw or pagecounts-all-sites directly, and aggregate from them directly.
The bash pipelines in the below footnotes are a rough, "order of magnitued" shot at that.
[1] For example running on stat1002:
Wikibooks for February 2015: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:34:55 // exit code: 0 cwd: ~ echo $(( ( $(grep '^[a-zA-Z_-]*.b ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 43
Wiktionary for February 2015: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:35:51 // exit code: 0 cwd: ~ echo $(( ( $(grep '^[a-zA-Z_-]*.d ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 244
[...]
Commons for February 2015: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:36:26 // exit code: 0 cwd: ~ echo $(( ( $(grep '^commons.m ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 329
[2] Note that the reports suffer the usual pagecounts-raw confusion around “.mw” being across projects. But that just a column header being wrong. The numbers themselves are fine: _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 12:40:22 // exit code: 0 cwd: ~ echo $(( ( $(grep '^[a-zA-Z_-]*.mw ' /mnt/hdfs/wmf/data/archive/pagecounts-raw/2015/2015-02/projectcounts-* | grep -v '(commons|meta|incubator|species|strategy|outreach|usability|quality)' | cut -f 3 -d ' ' | tr '\n' +)0 ) * 30 / 28 / 1000000 )) 6972