Analytics October 2014

analytics@lists.wikimedia.org

44 participants
43 discussions

by Christian Aistleitner

Hi, just a quick heads up, that Ops are about to add a “php” key to the X-Analytics header (i.e.: for sampled-1000 logs, hive, ...): https://gerrit.wikimedia.org/r/#/c/156793/ This header will hold the used PHP implementation [1]. Planned deployment is between 2014-09-01 and 2014-09-02. Have fun, Christian [1] https://wikitech.wikimedia.org/wiki/X-Analytics#Keys -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 3 months

Analytics Dev Team Commitments 2014-10-30 -- 2014-11-11

by Kevin Leduc

Hello, We kicked off our next sprint this morning, with the help of some release planning executed during the last 2 weeks. The sprint status is here: http://sb.wmflabs.org/t/analytics-developers/2014-10-30/ The focus of this sprint is working on the backend in preparation to display new data in Vital Signs. Bug ID Component Summary Points 72740 Dashiki Story: Vital Signs User selects the Daily Pageviews metrics 34 72741 EventLogging List tables/schemas with data retention needs 0 72642 EventLogging Story: Identify and direct the purging of Event logging raw logs older than 90 days in stat1002 0 67450 EventLogging database consumer could batch inserts (sometimes) 34 72746 Wikimetrics Story: WikimetricsUser tags a cohort using a pre-defined tag 5 72635 Wikimetrics report table performance, cleanup, and number of items 13 That’s 86 points in 4 stories. The bugs with 0 points are tasks for the team to track and follow up on, and the work mostly falls on other teams. Regards, Kevin Leduc

9 years, 5 months

stat1002 log cleanup

by Nuria Ruiz

Hello, To comply with our privacy policy we are going to purge logs in 1002 that are older than 90 days. Please let us know whether this is an issue. We hope to have these changes done by the end of next week. A concrete example: Logs in, for example, the eventlogging archiving directory: @stat1002:/a/eventlogging/archive$ will be restricted to the last 90 days. Thanks, Nuria

9 years, 5 months

s5 replica on analytics-store suffering lag of >12 hours

by Christian Aistleitner

Hi, just a quick heads up that the replication lag on analytics-store.eqiad.wmnet (i.e.: s5-analytics-slave, dbstore1002), is currently at 17 hours and increasing. I filed RT ticket 8788: https://rt.wikimedia.org/Ticket/Display.html?id=8788 Best regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 5 months

Analytics Dev Team Commitments 2014-10-16 -- 2014-10-28

by Kevin Leduc

Hello, The Analytics Development Team kicked off a sprint this morning. You can follow here: http://sb.wmflabs.org/t/analytics-developers/2014-10-16/ The theme for this sprint is fixing the metrics. BugID Component Summary Points 71255 <https://bugzilla.wikimedia.org/show_bug.cgi?id=71255> Wikimetrics Story: WikimetricsUser downloads large CSV <http://sb.wmflabs.org/b/71255/> 8 66843 <https://bugzilla.wikimedia.org/show_bug.cgi?id=66843> Wikimetrics Story: User creates cohort with CentralAuth insertions <http://sb.wmflabs.org/b/66843/> 21 72114 <https://bugzilla.wikimedia.org/show_bug.cgi?id=72114> Wikimetrics Story: VSUser has corrected historical edits/pages data <http://sb.wmflabs.org/b/72114/> 8 72134 <https://bugzilla.wikimedia.org/show_bug.cgi?id=72134> Wikimetrics Story: VSUser has bots filtered out of all metrics <http://sb.wmflabs.org/b/72134/> 34 That’s 71 points in 4 stories. During the sprint, the team will also start work on optimizing metric generation (Story # 69145: Creating an “editor_day" table - 34 points) but cannot commit to completing work in this sprint. One more thing: Marcel Ruiz Forns has joined the team. He will be focusing on developing features for Wikimetrics in support of the Grant Making team. This sprint, he is tackling Story 66843 <https://bugzilla.wikimedia.org/show_bug.cgi?id=66843>. cheers, Kevin Leduc

9 years, 5 months

Adventures in Clusterland 2014-10-20--2014-10-26

by Christian Aistleitner

Hi, in the week from 2014-10-20–2014-10-26 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Research on columnar storage in the cluster * Research on how to count of access to media files * Rolling out ACK tuning for varnishkafka * More work towards getting application id into logstash (details below) Have fun, Christian * Research on columnar storage in the cluster Columnar storage engines can help to speed up some queries we're running and plan to run. So some more research around Parquet and AVRO was done, and how xmldumps imports could benefit them. * Research on how to count of access to media files We had many requests making access counts for media files public. Since the basic infrastructural ingredients are within reach, we started to explore what would be doable towards getting such data public. * Rolling out ACK tuning for varnishkafka As reported for the previous week, the ACK tuning of varnishkafka showed to avoid message loss during leader elections. So we're incrementally deploying the new ACK parameter to caches, and 3 out of 4 clusters are using it already. The deployment for the fourth cluster is still pending. * More work towards getting application id into logstash Repackaging jars to inject the log4j configurations allowed to get more logs into logstash. And we're also starting to extract application ids from log messages, which will finally allow to go to logstash to get and filter logs for the applications (like Hive queries) one is running on the cluster. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 6 months

Adventures in Clusterland 2014-10-13--2014-10-19

by Christian Aistleitner

Hi, in the week from 2014-10-13–2014-10-19 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Webstatscollector deployment (Bug 66352, Bug 71790) * Testing potential kafkatee fix * Analytics1021, its partition leader role, and missing data * gp.wmflabs.org showing empty graphs * Database lags * Obtaining HTTPS numbers to assist around POODLE vulnerability * Redeployment of some Hive scripts * Preparations for ua_parser Hive UDF (details below) Have fun, Christian * Webstatscollector deployment (Bug 66352, Bug 71790) As reported previous weeks, new webstatscollector builds have been prepared to stop counting requests to the “Undefined” page (Bug 66352), and to stop counting redirects twice (Bug 71790). Those new builds now got deployed to both webstatscollector pipelines. * Testing potential kafkatee fix From time to time kafkatee did not consume from all relevant kafka partitions. The kafkatee maintainer provided a potential fix that is running on analytics1003 since. The kafkatee generated files look good for now, but since the issue previously took some time to manifest, the tests need to run a bit longer. * Analytics1021, its partition leader role, and missing data Analytics1021 again dropped out of its partition leader role. This is the first time it happened after ACK parameters got tuned on some machines. The tuning proved to be worth it, as the caches with tuned ACK parameters did not see message loss. Since the issue happened again later, and again exactly the machines with tuned ACK parameters did not see message loss, we can prepare to roll out the tuned ACK parameters more widely. * gp.wmflabs.org showing empty graphs In 2013 some graphs of gp.wmflabs.org have been taken offline due to privacy concerns. However, the main dashboard still referenced some of those graphs, and rendered them as empty graphs. This made the dashboard /look/ broken, although the public graphs were rendered as expected. We updated the dashboard to no longer reference offline graphs, so the dashboard does not look broken any longer. * Database lags Due to different, unrelated causes, some databases lagged considerably during this week. Ops got the databases back to normal again. * Obtaining HTTPS numbers to assist around POODLE vulnerability In order to decide on how to address the POODLE vulnerability, Ops needed numbers on usage of HTTPS for old browsers. Since this data is not prepared automatically, we extracted the numbers from the logs. * Redeployment of some Hive scripts It seems an unannounced Friday deployment during the SF hackathon angered the deployment gods, and caused some Oozie/Hive jobs to not run correctly. So we had to fix the setup, resubmit the jobs, and backfill the missing data. No data got lost. * Preparations for ua_parser UDF There is a push from several sides to have a Hive UDF that can parse User-Agents. A good part of time was spent implementing, and reviewing this UDF. But it's not yet merged and will require a bit more work. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

9 years, 6 months

Research discussion: Visions for Wikipedia

by Pine W

Both of the presentations at the October Wikimedia Research Showcase were fascinating and I encourage everyone to watch them [1]. I would like to continue to discuss the themes from the showcase about Wikipedia's adaptability, viability, and diversity. Aaron's discussion about Wikipedia's ongoing internal adaptations, and the slowing of those adaptations, reminded me of this statement from MIT Technology Review in 2013 (and I recommend reading the whole article [2]): "The main source of those problems (with Wikipedia) is not mysterious. The loose collective running the site today, estimated to be 90 percent male, operates a crushing bureaucracy with an often abrasive atmosphere that deters newcomers who might increase partipcipation in Wikipedia and broaden its coverage." I would like to contrast that vision of Wikipedia with the vision presented by User:CatherineMunro (formatting tweaks by me), which I re-read when I need encouragement: "THIS IS AN ENCYCLOPEDIA One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future, The clear water we must leave untainted for those who come after us, The fertile earth, in which truth may grow in bright places, tended by many hands, And the broad fall of sunshine, warming our first steps toward knowing how much we do not know." How can we align ouselves less with the former vision and more with the latter? [3] I hope that we can continue to discuss these themes on the Research mailing list. Please contribute your thoughts and questions there. Regards, Pine [1] youtube.com/watch?v=-We4GZbH3Iw [2] http://www.technologyreview.com/featuredstory/520446/the-decline-of-wikiped… [3] Lest this at first seem to be impossible, I will borrow and tweak a quote from from George Bernard Shaw and later used by John F. Kennedy: "Some people see things as they are and say, 'Why?' Let us dream things that never were and say, 'Why not?'"

9 years, 6 months

Fwd: Re: [Wikimedia-l] Chapters and GLAM tooling

by Pine W

Forwarding comments from Wikimedia-l that may be of interest to a number of subscribers on other lists. Pine ---------- Forwarded message ---------- From: "Erik Moeller" <erik(a)wikimedia.org> Date: Oct 25, 2014 5:59 PM Subject: Re: [Wikimedia-l] Chapters and GLAM tooling To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org> Cc: On Sat, Oct 25, 2014 at 7:16 AM, MZMcBride <z(a)mzmcbride.com> wrote: > Labs is a playground and Galleries, Libraries, Archives, and Museums are > serious enough to warrant a proper investment of resources, in my view. > Magnus and many others develop magnificent tools, but my sense is that > they're largely proofs of concept, not final implementations. Far from being treated as mere proofs of concept, Magnus' GLAM tools [1] have been used to measure and report success in the context of project grant and annual plan proposals and reports, ongoing project performance measurements, blog posts and press releases, etc. Daniel Mietchen has, to my knowledge, been the main person doing any systematic auditing or verification of the reports generated by these tools, and results can be found in his tool testing reports, the last one of which is unfortunately more than a year old. [2] Integration with MediaWiki should IMO not be viewed as a runway that all useful developments must be pushed towards. Rather, we should seek to establish clearer criteria by which to decide that functionality benefits from this level of integration, to such an extent that it justifies the cost. Functionality that is not integrated in this manner should, then, not be dismissed as "proofs of concept" but rather judged on its own merits. GWToolset [3] is a good example. It was built as a MediaWiki extension to manage GLAM batch uploads, but we should not regard this decision as sacrosanct, or the only correct way to develop this kind of functionality. The functionality it provides is of highly specialized interest, and indeed, the number of potential users to-date is 47 according to [4], most of whom have not performed significant uploads yet. Its user interface is highly specialized and special permissions + detailed instructions are required to use it. At the same time, it has been used to upload 322,911 files overall, an amazing number even without going into the quality and value of the individual collections. So, why does it need to be a MediaWiki extension at all? When development began in 2012, OAuth support in MediaWiki did not exist, so it was impossible for an external tool (then running on toolserver) to manage an upload on the user's behalf without asking for the user's password, which would have been in violation of policy. But today, we have other options. It's possible that storage requirements or other specific desired integration points would make it impossible to create this as a Tool Labs tool -- but if we created the same tool today, we should carefully consider that. Indeed, highly specialized tools for the cultural and education sector _are_ being developed and hosted inside Tool Labs or externally. Looking at the current OAuth consumer requests [5], there are submissions for a metadata editor developed by librarians at the University of Miami Libraries in Coral Gables, Florida, and an assignment creation wizard developed by the Wiki Education Foundation. There's nothing "improper" about that, as Marc-André pointed out. As noted before, for tools like the ones used for GLAM reporting to get better, WMF has its role to play in providing more datasets and improved infrastructure. But there's nothing inherent in the development of those tools that forces them to live in production land, or that requires large development teams to move them forward. Auditing of numbers, improved scheduling/queuing of database requests, optimization of API calls and DB queries; all of this can be done by individual contributors, making this suitable work for even chapters with limited experience managing technical projects to take on. On the analytics side, we're well aware that many users have asked for better access to the pageview data, either through MariaDB, or through a dedicated API. We have now said for some time that our focus is on modernizing the infrastructure for log analysis and collection, because the numbers collected by the old webstatscollector code were incomplete, and the infrastructure subject to frequent packet loss issues. In addition, our ability to meet additional requirements on the basis of simple pageview aggregation code was inherently constrained. To this end, we have put into production use infrastructure to collect and analyze site traffic using Kafka/Hadoop/Hive. At our scale, this has been a tremendously complex infrastructure project which has included custom development such as varnishkafka [6]. While it's taken longer than we've wanted, this new infrastructure is being used to generate a public page count dataset as of this month, including article-level mobile traffic for the first time [7]. Using Hadoop/Hive, we'll be able to compile many more specialized reports, and this is only just beginning. Giving community developers better access to this data needs to be prioritized relative to other ongoing analytics work, including but not limited to: - Continued development and maintenance of the above infrastructure foundations; - Development of "Vital Signs": public reports on editor activity, content contribution, sign-ups and other metrics. This tool gives us more timely access to key measures than WikiStats [9] (or the reportcard [10], which to-date still consumes Wikistats data). Rather than having to wait 4-6 weeks to know what's happening with regard to editor numbers, we can see continuous updates on a day-to-day basis. - Development of Wikimetrics, which analyzes the editing activity of a group of editors, and which is essential for measuring all movement work that targets increased activity by a targeted group (e.g. editathon), and is a key tool used for grants evaluation (was a funded program worth the $$?). A lot of thought has gone into the development of standardized global metrics [12] for program work, much of it using this technology and dependent on its continued development. - Measurement (instrumentation) of site actions and development/maintenance of associated infrastructure. As an example, in-depth data collection for features like Media Viewer (see dashboards at [13] ) is only possible because of the EventLogging extension developed by Ori Livneh, and the increasing use of this technology by WMF developers. EventLogging requires significant management, maintenance and teaching effort from the analytics team. Lila is requesting visibility into all primary funnels on Wikimedia sites (e.g. sign-ups, edits/saves through wikitext, edits/saves through VisualEditor, etc.), and this will require lots of sustained effort from lots of people to get done. What it will give us is a better sense of where people succeed and fail to complete an action -- by way of example, see the initial UploadWizard funnel analysis here: https://www.mediawiki.org/wiki/UploadWizard/Funnel_analysis - Improved software and infrastructure support for A/B testing, possibly including adoption of existing open source tooling such as Facebook's PlanOut library/interpreter [14]. - Improved readership metrics, possibly including a privacy-sensitive approach to estimating Unique Visitors, and better geographic breakdowns for readers/editors. These are all complex problems, most of which are dependent on the small analytics team, and feedback on projects and priorities is very much welcome on the analytics mailing list: https://lists.wikimedia.org/mailman/listinfo/analytics With regard to better embedding of graphs in wikis specifically, Yuri Astrakhan has led the development of a new extension, inspired by work by Dan Andreescu, to visualize data directly in wikis. This extension has been deployed already to Meta and MediaWiki.org and can be used for dynamic graphs where it's appropriate to not have a fallback to a static image, for example in grant reports. See: https://www.mediawiki.org/wiki/Extension:Graph https://www.mediawiki.org/wiki/Extension:Graph/Demo https://meta.wikimedia.org/wiki/Graph:User:Yurik_(WMF)/Obama I agree this is the kind of functionality that should make its way into Wikipedia. Again, we need to judge throwing a full team behind that against the relative priority of other work. In the meantime, Yuri and others will continue to push it along and may even be able to get it all the way there in due time. The main blockers, from what I can tell, are generation of static fallback images for users without JavaScript, and a better way to manage the data sources. In general, the point of my original message was this: All organizations that seek to improve Wikipedia and the other Wikimedia projects ultimately depend on technology to do so; to view WMF as the sole "tech provider" does not scale. Larger, well-funded chapters can take on big, hairy challenges like Wikidata; smaller, less-funded orgs are better positioned to work on specialized technical support for programmatic work. I would caution against requesting WMF to work on highly specialized solutions for highly specialized problems. If such solutions are needed, I would caution against building them into MediaWiki unless they can be generalized to benefit a larger number of users, at which point it's appropriate to seek partnership with WMF, or to ask WMF for the relative priority of such work. But often, it's perfectly fine (and much faster) to build such tools and reports independently, and to ask WMF for help in providing APIs/services/data/infrastructure to get it done. Cheers, Erik [1] http://tools.wmflabs.org/glamtools/ [2] https://outreach.wikimedia.org/wiki/Category:This_Month_in_GLAM_Tool_testin… [3] https://www.mediawiki.org/wiki/Extension:GWToolset [4] https://commons.wikimedia.org/w/index.php?title=Special%3AListUsers&usernam… [5] https://www.mediawiki.org/wiki/Special:OAuthListConsumers?name=&publisher=&… [6] https://github.com/wikimedia/varnishkafka [7] https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites [8] https://metrics.wmflabs.org/static/public/dash/ [9] http://stats.wikimedia.org/ [10] http://reportcard.wmflabs.org/ [11] https://metrics.wmflabs.org/ [12] https://meta.wikimedia.org/wiki/Grants:Learning_%26_Evaluation/Global_metri… [13] http://multimedia-metrics.wmflabs.org/dashboards/mmv [14] https://github.com/facebook/planout -- Erik Möller VP of Product & Strategy, Wikimedia Foundation _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

9 years, 6 months

Desktop screen size/viewport heatmaps

by Gilles Dubuc

Hi all, We set up some data collection a few weeks back to look at the distribution of actual screen size, viewport size and media viewer canvas size on sampled users. This will ultimately be used to come up with a better choice of thumbnail size buckets for Media Viewer. I had some spare time and figured I'd try to generate a visualization of that data, which we haven't analyzed yet. Here are the results: https://upload.wikimedia.org/wikipedia/commons/8/82/Screen_heatmap.png https://upload.wikimedia.org/wikipedia/commons/2/2f/Viewport_heatmap.png https://upload.wikimedia.org/wikipedia/commons/d/d3/Canvas_heatmap.png (Media Viewer-specific) I think it shows quite strikingly how screen size really doesn't matter much compared to the actual available viewport. Hopefully this data will be useful for other folks too, since I don't believe we tracked that information before. And given media viewer's traffic, it should be pretty representative of wikis in general. The data used to generate those images is all the data we've collected so far. I haven't looked at differences between wikis, etc. For people with analytics access, the EL table I dug that data from is MultimediaViewerDimensions_10014238 Note that this is mostly desktop, since it's very unusual to run Media Viewer on mobile devices, considering it hasn't been made for it and themobile site has its own MV-like lightbox. The code used to generate these is this quick and dirty Processing script I hacked together: https://phabricator.wikimedia.org/P39

9 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics October 2014