Analytics April 2015

analytics@lists.wikimedia.org

51 participants
46 discussions

X-Analytics is NULL on part of the data in webrequest
by Madhumitha Viswanathan 30 Apr '15

30 Apr '15

When I was working on related stuff, I found that the value of x_analytics_map ia null on the wmf.webrequest table in stat1002, when is_pageview is filtered for true, and agent_type is user. I'm wondering why that would be. These are the things I found: For 28th April 2015, of 741,858,511 requests, 28,827,374 have x_analytics(is set to '-') and x_analytics_map set to null. It's about 3.9% of all requests that day. You can find these counts by doing something like this on hive in the server. - SELECT count(*) FROM webrequest WHERE x_analytics_map IS NULL AND agent_type = 'user' AND is_pageview = TRUE AND YEAR = 2015 AND MONTH = 4 AND DAY = 28; Does anyone have ideas on why this might be and if something underlying is broken? --Madhu :)

2 1

Daily Pageviews in Dashiki showing no recent data
by Christian Aistleitner 30 Apr '15

30 Apr '15

Hi Analytics dev team, just a heads up that it's a week that the pagecounts-all-sites (and pagecounts-raw) did not have the 20150409-160000 file generated [1]. To ease data quality assurances and avoid faulty aggregates, the pageview aggregator scripts that do the aggregation for dashiki's “Reader / Daily Pageviews” block for a week on missing data (unless they are being told that for a given day, missing data is ok). For the above hourly pagecounts-all-sites file, this week of blocking has now passed without action. Hence, the aggregator scripts will start aggregating again (to some degree), but the undeclared hole for the 2015-04-09 in the data will naturally start to bubble up. If that hour's file cannot get generated, adding this date to the BAD_DATES.csv of the aggregator data repository, will unblock the aggregator cron job and make weekly, monthly, aggregates consider 2015-04-09 as day without data. If that hour's file gets generated, be aware that aggregator per default only automatically backfills for a week. So from today on, you need to explicitly run the script to backfill for 2015-04-09. Have fun, Christian P.S.: Since I guess the question of monitoring will arise ... the missing pagecounts file has alerted people at least twice by email. The subsequent aggregator blocking has been logged. But you can add yourself in the MAILTO of the aggregator cron at modules/statistics/manifests/aggregator.pp in puppet, if you want an additional notification for that. [1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-04/ http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-04/ -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 4

Predicting movie success using Wikipedia traffic
by Alex Harrison 30 Apr '15

30 Apr '15

Hi Wikipedia analysts, I work in product development for the Insights team at Way To Blue. We provide reports to major film studios to provide them with information about how their films are doing on social platforms over the course of their marketing campaign, up to a film's release in cinemas. We typically focus on social media volumes (and traditionally split by Twitter, Forums, Blogs and News), however I just stumbled across the free Wikipedia article traffic statistics site (http://stats.grok.se/ ) and am thinking that daily Wikipedia traffic could be an interesting additional metric for us to report. I have a few questions about daily Wikipedia traffic data: - Are you affiliated with the site above, or do you independently cover traffic volume? - Do you know if it is possible to attribute a source country to traffic count? E.g. I can see that the daily traffic on the Tomorrowland (film) page was 5773 yesterday, however it would be interesting what proportion of that is US vs. UK vs. etc. We have never used Wikipedia traffic as a source, so if you are the wrong team to be asking about this please let me know! Best regards Alex

7 6

Hadoop Cluster Downtime
by Andrew Otto 29 Apr '15

29 Apr '15

Hi all! CDH 5.4 is out[1] and we’d like to upgrade. We are doing this now, rather than later, because there is an important Parquet/Hive related bug that has been fixed in this version[2]. This upgrade will include Spark 1.3, which should at least make one researcher happy. To do this upgrade, I need to schedule some downtime for Hadoop. I’d like to do this on Monday May 4th. I expect the upgrade to take me no more than an hour or two, but just to be safe I’d like to schedule the downtime for the whole day. If anyone has critical things that they absolutely have to run on Monday, let me know now and I will find another day. Thanks! -Ao [1] http://blog.cloudera.com/blog/2015/04/cloudera-enterprise-5-4-is-released/ <http://blog.cloudera.com/blog/2015/04/cloudera-enterprise-5-4-is-released/> [2] https://issues.apache.org/jira/browse/HIVE-9482 <https://issues.apache.org/jira/browse/HIVE-9482>

1 0

[Technical] WMF-Last-Access
by Dan Andreescu 29 Apr '15

29 Apr '15

Brandon, I had a "monday morning quarterback" moment (don't worry, it's not too bad) The key we chose is "WMF-Last-Access" and it seems to me that's using a lot of unnecessary network bandwidth with its verbosity. We could come up with something shorter (I cc-ed Analytics in case anyone has an opinion) and save our network. My proposal: simply "last" For those unfamiliar, we're talking about this change: https://gerrit.wikimedia.org/r/#/c/196009/14/templates/varnish/last-access.… to this header: https://wikitech.wikimedia.org/wiki/X-Analytics

6 8

s1-analytics-slave under the weather
by Sean Pringle 28 Apr '15

28 Apr '15

Hi! s1-analytics-slave has been struggling recently (SUL finalization load, plus some other stuff). I've had to pause EventLogging replication there tonight in order to let S1 catch up, as well as do some table maintenance. I estimate 24 hours impact. analytics-store is not affected. BR Sean

1 0

udp2log shutdown (for analytics instances) next week
by Andrew Otto 27 Apr '15

27 Apr '15

Hi all! Now that all data that is generated by udp2log is also being generated by the Analytics Cluster, we are finally ready to turn off analytics udp2log instances. I will start with the ones that are used to generate the logs on stat1002 at /a/squid/archive. The (identical) cluster generated logs can be found on stat1002 at /a/log/webrequest/archive. I will paste the contents of the README file in /a/squid/archive describing the differences at the bottom of this email. If you use any of the logs in /a/squid/archive for regular statistics, you will need to switch your code to use files in /a/log/webrequest/archive instead. I plan to start turning off udp2log instances on Monday April 27th (that’s next week!). From the README: [@stat1002:/a/squid/archive] $ cat README.migrate-to-hive.2015-02-17 *********************************************************************** * * * This directory will run stale once udp2log will get turned off. * * Please use the corresponding TSVs from /a/log/webrequest/archive/ * * instead. * * * *********************************************************************** The TSV files in this directory underneath /a/squid/archive get generated by udp2log and suffer from * Sub-par data quality (E.g.: udp2log had an inherent loss). * Lack of a way to backfill/fix data. * Some files consuming https requests twice, which made filtering necessary. * Consfusing naming scheme, where each file covered 24 hours, but not midnight to midnight, but ~06:30 previous day to ~06:30 current day. The new TSVs at /a/log/webrequest/archive/ contain the same information but get generated by Hive, and address the above four issues: * By using Hive's webrequest table as input, the inherent loss is gone. Also statistics on the hour's data quality are available. * Hive data allows to backfill/fix data. * Only data from the varnishes gets picked up. So https traffic no longer gets duplicated. * The files now cover 24 hours from midnight to midnight. No more stitching/cutting is needed to get the logs for a given day. Please migrate to using the Hive-generated TSVs from /a/log/webrequest/archive/ Thanks! I’ll keep you updated as this happens. -Andrew Otto

3 2

Re: [Analytics] Wikimedia Foundation quarterly reviews
by Tilman Bayer 27 Apr '15

27 Apr '15

And here are the minutes and slides from the remaining two quarterly review meetings from this round: Legal, Finance, Talent & Culture (HR), Communications: https://meta.wikimedia.org/wiki/WMF_Metrics_and_activities_meetings/Quarter… Analytics, User Experience, Team Practices, Product Management https://meta.wikimedia.org/wiki/WMF_Metrics_and_activities_meetings/Quarter… Much of the content of the slides will also (in somewhat more polished form) feature in the WMF quarterly report, which is planned to be published by May 15. On Tue, Apr 21, 2015 at 9:53 PM, Tilman Bayer <tbayer(a)wikimedia.org> wrote: > Hi all, > > the quarterly reviews for the past quarter (January-March 2015) took > place last week. Minutes and slides are now available for the > following meetings: > > Community Engagement, Advancement (Fundraising and Fundraising Tech): > https://meta.wikimedia.org/wiki/WMF_Metrics_and_activities_meetings/Quarter… > > Mobile Web, Mobile Apps, Wikipedia Zero: > https://meta.wikimedia.org/wiki/WMF_Metrics_and_activities_meetings/Quarter… > > Parsoid, Services, MediaWiki Core, Tech Ops, Release Engineering, > Multimedia, Labs, Engineering Community: > https://meta.wikimedia.org/wiki/WMF_Metrics_and_activities_meetings/Quarter… > > Editing (covering VisualEditor), Collaboration (covering Flow), > Language Engineering: > https://meta.wikimedia.org/wiki/WMF_Metrics_and_activities_meetings/Quarter… > > > As mentioned in February [1], the quarterly review process has been > extended to basically all groups in the Foundation since Lila took the > helm last year, and it was further refined this quarter, reducing the > number of meetings to six overall, each combining several areas. > Minutes and slides from the remaining two meetings should come out > soon, too. (And naturally, all the engineering team names above refer > to the structure before the reorganization that has just been > announced.) > > [1] https://lists.wikimedia.org/pipermail/wikimedia-l/2015-February/076835.html > > On Wed, Dec 19, 2012 at 6:49 PM, Erik Moeller <erik(a)wikimedia.org> wrote: >> >> Hi folks, >> >> to increase accountability and create more opportunities for course >> corrections and resourcing adjustments as necessary, Sue's asked me >> and Howie Fung to set up a quarterly project evaluation process, >> starting with our highest priority initiatives. These are, according >> to Sue's narrowing focus recommendations which were approved by the >> Board [1]: >> >> - Visual Editor >> - Mobile (mobile contributions + Wikipedia Zero) >> - Editor Engagement (also known as the E2 and E3 teams) >> - Funds Dissemination Committe and expanded grant-making capacity >> >> I'm proposing the following initial schedule: >> >> January: >> - Editor Engagement Experiments >> >> February: >> - Visual Editor >> - Mobile (Contribs + Zero) >> >> March: >> - Editor Engagement Features (Echo, Flow projects) >> - Funds Dissemination Committee >> >> We’ll try doing this on the same day or adjacent to the monthly >> metrics meetings [2], since the team(s) will give a presentation on >> their recent progress, which will help set some context that would >> otherwise need to be covered in the quarterly review itself. This will >> also create open opportunities for feedback and questions. >> >> My goal is to do this in a manner where even though the quarterly >> review meetings themselves are internal, the outcomes are captured as >> meeting minutes and shared publicly, which is why I'm starting this >> discussion on a public list as well. I've created a wiki page here >> which we can use to discuss the concept further: >> >> https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/Quarterly_r… >> >> The internal review will, at minimum, include: >> >> Sue Gardner >> myself >> Howie Fung >> Team members and relevant director(s) >> Designated minute-taker >> >> So for example, for Visual Editor, the review team would be the Visual >> Editor / Parsoid teams, Sue, me, Howie, Terry, and a minute-taker. >> >> I imagine the structure of the review roughly as follows, with a >> duration of about 2 1/2 hours divided into 25-30 minute blocks: >> >> - Brief team intro and recap of team's activities through the quarter, >> compared with goals >> - Drill into goals and targets: Did we achieve what we said we would? >> - Review of challenges, blockers and successes >> - Discussion of proposed changes (e.g. resourcing, targets) and other >> action items >> - Buffer time, debriefing >> >> Once again, the primary purpose of these reviews is to create improved >> structures for internal accountability, escalation points in cases >> where serious changes are necessary, and transparency to the world. >> >> In addition to these priority initiatives, my recommendation would be >> to conduct quarterly reviews for any activity that requires more than >> a set amount of resources (people/dollars). These additional reviews >> may however be conducted in a more lightweight manner and internally >> to the departments. We’re slowly getting into that habit in >> engineering. >> >> As we pilot this process, the format of the high priority reviews can >> help inform and support reviews across the organization. >> >> Feedback and questions are appreciated. >> >> All best, >> Erik >> >> [1] https://wikimediafoundation.org/wiki/Vote:Narrowing_Focus >> [2] https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings >> -- >> Erik Möller >> VP of Engineering and Product Development, Wikimedia Foundation >> >> Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate >> >> _______________________________________________ >> Wikimedia-l mailing list >> Wikimedia-l(a)lists.wikimedia.org >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > > > > > -- > Tilman Bayer > Senior Analyst > Wikimedia Foundation > IRC (Freenode): HaeB -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

1 0

[Announce] a new release of Pageviews data
by Oliver Keyes 27 Apr '15

27 Apr '15

Hey all, We've just released a count of pageviews to the English-language Wikipedia from 2015-03-16T00:00:00 to 2015-04-25T15:59:59, grouped by timestamp (down to a one-second resolution level) and site (mobile or desktop). The smallest number of events in a group is 645; because of this, we are confident there should not be privacy implications of releasing this data. We checked with legal first ;p. If you're interested in getting your mitts on it, you can find it at DataHub (http://datahub.io/dataset/english-wikipedia-pageviews-by-second) or FigShare (http://figshare.com/articles/English_Wikipedia_pageviews_by_second/1394684) -- Oliver Keyes Research Analyst Wikimedia Foundation

1 0

[Technical] X-analytics header mobile apps items
by Nuria Ruiz 23 Apr '15

23 Apr '15

Team: Would you be so kind as to document the mobile apps info that should be present on X-analytics header for apps requests? https://wikitech.wikimedia.org/wiki/X-Analytics I thought the uuid that identifies a unique user of the app was "uuid" but I also see a "wmfuuid" and I am not sure if these two are the same. Need to clarify this to be able to calculate mobile sessions. Many thanks, Nuria

6 14

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics April 2015