Analytics May 2014

analytics@lists.wikimedia.org

35 participants
39 discussions

by ENWP Pine

Hi Analytics, I'm looking for a way to do a cohort analysis for a presentation that I'm drafting. I want a report that shows: * A list of languages showing the users who speak each language as identified on their user pages * A list of projects where users have made at least 5 edits in the past 12 months * A list of group members that have administrator rights and which wikis are involved * A list of public mailing lists where members have contributed in the past 12 months * Number of public emails on those mailing lists in the past 12 months * Total edits made by the cohort * Total bytes changed by the cohort * Total logged-in time for the cohort, if log-in time aggregation is being done What automated tools could I use to create this report? Are there any significant editor productivity metrics that are missing from this list? Thanks, Pine

9 years, 11 months

Upcoming research newsletter (May 2014): new papers open for review

by Dario Taraborelli

Hi everybody, we’re preparing for the May 2014 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201405 and add your name next to any paper you are interested in covering. As usual, short notes and one-paragraph reviews are most welcome. Highlights from this month: • Detecting epidemics using Wikipedia article views: A demonstration of feasibility with language as location proxy • Wikipedia in the eyes of its beholders: A systematic review of scholarly research on Wikipedia readers and readership • "The sum of all human knowledge": a systematic review of scholarly research on the content of Wikipedia • Uneven Openness: Barriers to MENA Representation on Wikipedia • Sex ratios in Wikidata • Automatically Detecting Corresponding Edit-Turn-Pairs in Wikipedia • A Novel Methodology Based on Formal Methods for Analysis and Verification of Wikis • Okinawa in Japanese and English Wikipedia • Bipartite Editing Prediction in Wikipedia • Increasing the Discoverability of Digital Collections Using Wikipedia: The Pitt Experience • Playscript Classification and Automatic Wikipedia Play Articles Generation If you have any question about the format or process feel free to get in touch off-list. Dario Taraborelli and Tilman Bayer [1] http://meta.wikimedia.org/wiki/Research:Newsletter

9 years, 11 months

Re: [Analytics] Cohort analysis

by ENWP Pine

Thanks everyone. It looks like the existing analysis tools for the most part aren't capable of doing what I'd like them to do, so I'll manually look up some information. The cohort that I am analyzing is the current Individual Engagement Grants Committee. By my manual count we have 15 members who speak a combined 16 languages and are geographically located on 5 continents. I was hoping to get a lot more detail about the aggregate productivity and diversity of the group. I am using this group as an example of a Meta-level committee for a presentation to a group of non-Wikipedia technologists. If easy automation was available I would create similar reports for Stewards, AffCom, GAC, the FDC, and the Wikimania Committee. These kinds of cross-wiki statistics may be useful in the next strategic planning process so I hope more of these tools will be available in the near future. Pine

9 years, 11 months

ScanMail

by ENWP Pine

Hi Erik Z., I did look at http://http://www.infodisiac.com/Wikipedia/ScanMail/_PowerPosters.html It seems to me that some of the color codes for lists are incorrect. Analytics, EE, and Research are categorized as F lists while Commons is categorized as a T list. Can you explain why you have these categories set this way, or change the categories? Thanks, Pine

9 years, 11 months

Please help maintain our dashboard directory

by Erik Moeller

The only reference dashboard directory we have right now, AFAIK, is: https://meta.wikimedia.org/wiki/Research:Data/Dashboards There's a bunch of Limn dashboards missing from this list, and there are probably dead/unmaintained ones still on it. I appreciate help keeping it in shape. Thanks, Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

9 years, 11 months

Fwd: Per-file view counts

by Gergo Tisza

(resending to analytics@) tl;dr I am hoping the setup for wikipage view counts (based on Varnish UDP logging) could be reused for thumbnails, and I am asking for advice on that. Hi all, while the tracking of user behavior on Wikimedia sites is generally in a sad state, there is a decent-enough external tool [1] for telling which Wikipedia articles are popular and how that popularity changes day-to-day. The same is not true for images, for which there are no public view statistics at all. As a poor but still-better-than-nothing tool, people have been looking and file description page view counts to get an estimate of the level of interest in an image. This will be rendered mostly useless by MediaViewer (per our logs, only about 2% of readers follow through to the file page), which makes GLAM people sad. I am looking into alternative ways for supporting image curation with usage statistics. I can see four use cases here: 1. track how often an image is seen 2. track how many people are interested in an image 3. track how many people download/reuse an image 4. track how many people are interested in image metadata (e.g. GLAMs like to know not only how many people have seen the image, but also how many of them have seen which institution it comes from) I haven't yet collected much information on how much people actually want each of these; I would like to understand first which of them are realistically doable. I'll share my vague plans on how to implement them; I would appreciate if you could poke holes in them / suggest better solutions. == Tracking image view counts (including thumbnails) == This seems largely analogous to how page view counts are tracked: in one case we need to know Varnish html hit counts, in the other Varnish image hit counts. For the page view counts, according to [2] Varnishes send an UDP packet to a statistics server whenever a page is requested; the results are aggregated in hourly buckets, published at [2], then further aggregated into daily buckets, put into an SQL database and visualised on a 3rd party server. This seems to be easily replicable for thumbnails / image originals: deduplicate by file name (and possibly some sort of thumbnail size buckets), save in dump files, ask Henrik to include them in stats.grok.se(it would be probably easy to hack them into the current system if they get fake wikidb names). It would be even more reliable than page view counts because image redirects work differently from article redirects, and don't split the view counts forever. The big drawback is caching: while HTML pages are not cached (in the sense that the browser always sends a request for them), most images are [3]. I don't think this could be realistically helped (in theory it could be possible to calculate from the page view stats and the imagelinks table the exact number of times an image has been displayed, but doing that on the scale of Wikipedia pageviews is not plausible), and it might even be considered a good thing: instead of view counts, we get a reasonable approximation of unique visitors (image viewers). == Tracking people "expressing interest" in a file (whatever that means) == Knowing how many people see the thumbnail of a file is important, but - unlike article view counts - it does not really say how many people show any interest in an image. It might just happen to be included in an article they are interested in; maybe they never scrolled to the bottom of the article and haven't seen the actual image at all. So it is useful to know how many people clicked through the image. Without MediaViewer, file page view counts can (and are) used for this (even if they have their own problems for Commons images); with MediaViewer there would have to be a way to tell apart a thumbnail that was requested for use in MediaViewer vs. for some other reason. This could be done similarly to how the ?download query parameters are handled: append a source=mediaviewer URL parameter to the filename, have Varnish rewrite it to avoid cache splitting, add the parameter to the UDP packet and filter on it when processing the logs. This would be a fairly generic mechanism which could be potentially used for other things (source=filepage, source=hovercards etc); it would still split the cache on the browser side, but given that the large thumbnails used by MediaViewer don't overlap much with the thumbnail sizes used on wiki pages, this does not seem tragic. == Tracking the number of downloads/reuses == MediaViewer adds a ?download parameter to download links (to ensure that the Content-Disposition header is set), which could be used to track downloads. (It would miss downloads from other sources; I'm not sure there is a generic solution. Maybe something based on referrer or some other header, in case the browsers set those differently for downloaded and displayed images?) Reuse is too vague a category to say anything about, but tracking views of an image which originate outside the Wikimedia universe seems possible, but too much effort and too different from the previous methods to be worth spending time on now. == Tracking the interest in image metadata == MediaViewer collects global statistics on the ratio of people viewing an image vs. following through to the file page vs. scrolling down to open the panel with file metadata information. I don't think a per-file tracking of this would be useful or worth the effort; if it is needed, it would have to be done by some EventLogging-ish setup with MediaViewer creating tracking gifs whenever the metadata information is opened. What do you think, is any of this plausible? [1] http://stats.grok.se/ [2] http://dumps.wikimedia.org/other/pagecounts-raw/ [3] looking at the effects of a page load in the network tab, it seems that most thumbnails on a page are loaded from the browser cache, while there are a few which are requested from the servers which respond with a 304. Which images do that is deterministic but otherwise seems totally random; e.g. on the current enwiki mainpage, RalphBakshiJan09.jpg is loaded from cache while Prayuth_Jan-ocha_2010-06-17_ITN.jpg is always requested. I am curious about the reason for this.

9 years, 11 months

Using EventLogging for funnel analysis

by Gergo Tisza

Hi all, the Multimedia team is preparing to collect data to better understand usability problems with UploadWizard. UW has a "checkout" structure (step 1: put files in basket, step 2: choose license, step 3: add description, step 4: you are done), so a funnel analysis to identify which step causes the most users to abort the upload process and why seems like a good approach. I'm trying to understand how well the existing EventLogging infrastructure supports this. The problem is how to get information about the actions of users who fell out of the funnel. I'll try to illustrate with an example: in one of the steps, the user can choose between "I am uploading my own work" and "I am uploading someone else's work" and the resulting interaction will be quite different. We would like to know whether that choice has a big effect on the likeliness of the user making it to the next step. Using EventLogging, I can count the number of users who make it until that step. I can count the number of users making it to the next step. I can count the number of users choosing this or that author option. These numbers do not tell us much on their own, though; the interesting information would be how they are correlated. Another thing I could do is creating a schema which includes both the choice of author option and the step number; when the user chooses "own work", we log an ownwork event, when they click "next step", we log a step(step=3, work=own) event. We can then calculate the number of users who did choose "own work" but did not make it to the next step as the difference of the two. But this won't work: "own work" is a radio button, the user select and deselect it any number of times before proceeding to the next step (or leaving the page). So what we are trying to log are not really events but application states that describe users who are successful vs. unsuccessful in the given step. I thought of two ways of dealing with this; any feedback on the plausibility of these or possible alternatives would be highly appreciated. One would be to have a "step X succeeded" and a "step X failed" event (the schema for which could include all sorts of state, such as which authorship option was selected). This would require the ability to log an event when the user leaves the page. I see two ways two do that: - send the event log as a synchronous request from an unload event handler. This is not supported on ancient browsers; also, there is probably some mechanism in most browsers to kill an unload event handler if it takes long. - store the event in cookies/localStorage, log it on the next page load. This works in all browsers but it is less reliable (what if the user never comes back?) and logs the event for a different page load from where it actually occurred (what if the user comes back after a month?), and probably runs int all sorts of complications with multiple tabs. The other way could be to log event chains: set a random identifier (which only lives until the page is unloaded), and add it to every event. Event groups can then be merged into meta-events by SQL magic, although that looks like it will be extremely painful to do. On the other hand, this is much more generic than the previous method, and could be used to answer more complex questions. What do you think? Which would be the method I am not shooting myself in the foot with? Currently I am leaning towards using unload handlers.

9 years, 11 months

Notes on mobile instrumentation needs

by Dario Taraborelli

This is a summary of my discussion with Maryana on mobile instrumentation needs: Registration tagging The ServerSideAccountCreation log only knows about mobile vs desktop registrations. Down the line we should modify it to tag new accounts by source {desktop, mobile web, mobile app, other}. For now, we can just use the existing log and extract the source by combining the userAgent and displayMobile fields. There are comments on the Schema:MobileWikiAppCreateAccount talk page that require feedback from the mobile team, if we intend to use this log for signups on apps. Revision tagging We propose the creation of a new “app” tag in MediaWiki on top of the current “mobile edit” tag, which we will keep for backward-compatibility. This will give us an easy way of counting all mobile edits combined or app-only edits. Detailed breakdown of edit volume/quality by app platform or version will be done via EventLogging instrumentation. App instrumentation Platform-specific logs such as Schema:MobileWikiAppEdit are an ok interim solution but we should revamp the original plans for a consolidated edit funnel instrumentation to be able to compare edit funnel events across different platforms and interfaces. Given that this data is not product/feature specific, it’s likely an Analytics responsibility to implement it. Tablet edits We still don’t have a consistent way of identifying and counting edits made on tablets (on our desktop or mobile site), but we can get some initial estimates by using UA data from the last 30 days. Mobile dashboards We will review the existing mobile dashboards to figure out which ones should be eventually migrated to Vital Signs. Dario

9 years, 11 months

Monthly research & data showcase livestreamed today

by Dario Taraborelli

The next Research & Data showcase will be live-streamed today Wed 5/21 at 11.30 PT. The streaming link will be posted on the lists a few minutes before the showcase starts and as usual you can join the conversation on IRC at #wikimedia-research. We look forward to seeing you! Dario This month: UX research at WMF Introducing Abbey Ripstra, the new UX research lead at the Wikimedia Foundation. A bird's eye view of editor activation by Dario Taraborelli -- In this talk I will give a high-level overview of data on new editor activation, presenting longitudinal data from the largest Wikipedias, a comparison between desktop and mobile registrations and the relative activation rates of different cohorts of newbies. Collaboration patterns in Articles for Creation by Aaron Halfaker -- Wikipedia needs to attract and retain newcomers while also increasing the quality of its content. Yet new Wikipedia users are disproportionately affected by the quality assurance mechanisms designed to thwart spammers and promoters. English Wikipedia’s en:WP:Articles for Creation provides a protected space for newcomers to draft articles, which are reviewed against minimum quality guidelines before they are published. In this presentation, describe and a study of how this drafting process has affected the productivity of newcomers in Wikipedia. Using a mixed qualitative and quantitative approach, I'll show the process's pre-publication review, which is intended to improve the success of newcomers, in fact decreases newcomer productivity in English Wikipedia and offer recommendations for system designers.

9 years, 11 months

Analytics maintainers

by Erik Moeller

This could use some updating to reflect current team roles/membership: https://www.mediawiki.org/wiki/Developers/Maintainers#Analytics https://www.mediawiki.org/wiki/Analytics#Projects -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

9 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2014