On 12 March 2015 at 19:41, Erik Zachte ezachte@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers.
So I just uploaded https://commons.wikimedia.org/wiki/File:PageViewsWikipedia2015.png which shows daily page views as collected by webstatscollector since 2008 and published in hourly projectcounts files in https://dumps.wikimedia.org/other/pagecounts-raw/ and aggregated by Wikistats per project (by week, month, day of week) and published in e.g. http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm (Wikipedia only, but webstatscollector doesn't report on any huge PV increase for other projects)
My initial comment in this thread (again) is that you defined a 'legacy' definition yourself, and built a script to implement your legacy definition.
Actually, no; the UDF Is a replica of the Hive implementation of your definition, which Christian wrote.
Which is fine with me, the more data points the better, but should not be confused with vetting new vs old stats. The old stats we published for many years, using which I will dub from now on the 'real legacy definition'. That real legacy definition, with all of its known deficiencies, is what will matter for our veteran users and any discrepacy from there needs explaining.
Since it's all in your head now, and you spent a long time to get it there, I'd still recommend you finish this off and explain what has changed rather than looking to a new person to do this.
Unfortunately I've been moved from R&D, and don't have the time to answer endless "just one more thing..." questions. Again, if Toby wishes to ask Erik if he can borrow me, that's fine too.
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Oliver Keyes Sent: Friday, March 13, 2015 0:00 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Technical] final pageviews QA
Hmn. And now the UDF, the hive query, and the monthly aggregate of the hive query, all disagree with http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm . All of the aforementioned sources come up with 24bn, not 20.38. Erik, how is your data constructed from the pagecounts files, exactly? It's not made clear.
I'd find it easier to believe it was an implementation problem if the UDF and hive query didn't agree. Could it be some distinction in how the subsidiary hive table is turned into stats.wikimedia.org numbers, from the "raw" count of pageviews?
In any case, this is now going somewhat beyond "Oliver, please run a quick final check on the final definition"; that check has been run and shows a pretty stable definition, without any odd day-to-day yo-yoing and a clear week/weekend pattern, which is what we expect. For additional analysis, I'd suggest either assigning someone to this task (presumably whoever is maintaining the definition now) or, of course, asking Erik if you could borrow me. I'm always happy to help out when I have the time :).
On 12 March 2015 at 18:43, Oliver Keyes okeyes@wikimedia.org wrote:
Certainly; running now.
On 12 March 2015 at 18:33, Toby Negrin tnegrin@wikimedia.org wrote:
Can we compare the monthly totals?
On Thu, Mar 12, 2015 at 3:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, again; the wikistats data that Erik refers to doesn't have any granularity within the period this dataset covers. Monthly data misses sub-monthly noise - like a massive transition that only kicks in on the day-by-day.
On 12 March 2015 at 18:21, Toby Negrin tnegrin@wikimedia.org wrote:
I'm also confused. As I understand it, stats.wikimedia.org is consuming the data that is represented by the green line in your graph. Therefore we would see this drop in the wikistats data that Erik referred to, but we don't. I think we need to understand why this is so.
-Toby
On Thu, Mar 12, 2015 at 3:10 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, I'm no longer our resident anything expert, merely /a/ anything expert :).
The "concoction", as you put it, comes from the webrequest_all_sites data that is consumed by stats.wikimedia.org's primary report - I can't speak for how the dashboard you're linking to is constructed. Perhaps you could? I doubt this is a "concoction" problem given that, as you will note if you've studied the visualisations, both the UDF and the hive query implementation (which were written by two different people, and code reviewed by two /more/ people) agree that this dramatic, unexplained and untracked drop happened. And, since we've been using the hive query implementation for all our high-level numbers for about six months, a bug of this magnitude in the /implementation/ of the definition would be....worrying.
Indeed, your report says 20B per month (again, is it drawing from the same data source as the aggregate, high-level number?) - I never claimed 1.1B a day, you did. Instead, it started off as approximately 1.1-1.2Bn, before dropping down to between 600m and 700m, where it has resided ever since. That sounds, averaged, like approximately 0.75B, no? The disadvantage of comparing a single monthly number against a more granular dataset.
On 12 March 2015 at 17:55, Erik Zachte ezachte@wikimedia.org wrote: > I'd rather see you explain this, Oliver, as our incumbent page > views expert. > Your concoction of legacy PV seems to suggest 'Old definition, UDF' > was > about 1.1B per day. > > Yet > http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects > .htm shows 20B per month, 0.75B per day > > Erik > > -----Original Message----- > From: analytics-bounces@lists.wikimedia.org > [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of > Oliver Keyes > Sent: Thursday, March 12, 2015 19:38 > To: A mailing list for the Analytics Team at WMF and everybody > who has an interest in Wikipedia and analytics. > Subject: [Analytics] [Technical] final pageviews QA > > Hey all, > > After the patches to the definition following the previous > hand-coding run (see older threads) I've run a second set of > tests. These can be seen at > https://commons.wikimedia.org/wiki/File:Pageviews_QA_2.png and > https://commons.wikimedia.org/wiki/File:Pageviews_QA_jittered_2 > .png > > There's nothing particularly shocking in the new definition; it > follows the seasonal pattern that we're used to. I think we can > call the new definition done, with these tweaks! It's also not > as unstable as the legacy definition (good luck to whoever now > has the responsibility of explaining why pageviews abruptly > halved in the middle of February). > > > Have fun, > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics