Hi Analytics,
I'm looking for a way to do a cohort analysis for a presentation that I'm drafting.
I want a report that shows:
* A list of languages showing the users who speak each language as identified on their user pages
* A list of projects where users have made at least 5 edits in the past 12 months
* A list of group members that have administrator rights and which wikis are involved
* A list of public mailing lists where members have contributed in the past 12 months
* Number of public emails on those mailing lists in the past 12 months
* Total edits made by the cohort
* Total bytes changed by the cohort
* Total logged-in time for the cohort, if log-in time aggregation is being done
What automated tools could I use to create this report?
Are there any significant editor productivity metrics that are missing from this list?
Thanks,
Pine
Hi everybody,
we’re preparing for the May 2014 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201405 and add your name next to any paper you are interested in covering. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
• Detecting epidemics using Wikipedia article views: A demonstration of feasibility with language as location proxy
• Wikipedia in the eyes of its beholders: A systematic review of scholarly research on Wikipedia readers and readership
• "The sum of all human knowledge": a systematic review of scholarly research on the content of Wikipedia
• Uneven Openness: Barriers to MENA Representation on Wikipedia
• Sex ratios in Wikidata
• Automatically Detecting Corresponding Edit-Turn-Pairs in Wikipedia
• A Novel Methodology Based on Formal Methods for Analysis and Verification of Wikis
• Okinawa in Japanese and English Wikipedia
• Bipartite Editing Prediction in Wikipedia
• Increasing the Discoverability of Digital Collections Using Wikipedia: The Pitt Experience
• Playscript Classification and Automatic Wikipedia Play Articles Generation
If you have any question about the format or process feel free to get in touch off-list.
Dario Taraborelli and Tilman Bayer
[1] http://meta.wikimedia.org/wiki/Research:Newsletter
Thanks everyone. It looks like the existing analysis tools for the most part aren't capable of doing what I'd like them to do, so I'll manually look up some information.
The cohort that I am analyzing is the current Individual Engagement Grants Committee. By my manual count we have 15 members who speak a combined 16 languages and are geographically located on 5 continents. I was hoping to get a lot more detail about the aggregate productivity and diversity of the group. I am using this group as an example of a Meta-level committee for a presentation to a group of non-Wikipedia technologists. If easy automation was available I would create similar reports for Stewards, AffCom, GAC, the FDC, and the Wikimania Committee.
These kinds of cross-wiki statistics may be useful in the next strategic planning process so I hope more of these tools will be available in the near future.
Pine
Hi Erik Z.,
I did look at http://http://www.infodisiac.com/Wikipedia/ScanMail/_PowerPosters.html
It seems to me that some of the color codes for lists are incorrect.
Analytics, EE, and Research are categorized as F lists while Commons is categorized as a T list.
Can you explain why you have these categories set this way, or change the categories?
Thanks,
Pine
The only reference dashboard directory we have right now, AFAIK, is:
https://meta.wikimedia.org/wiki/Research:Data/Dashboards
There's a bunch of Limn dashboards missing from this list, and there
are probably dead/unmaintained ones still on it. I appreciate help
keeping it in shape.
Thanks,
Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
(resending to analytics@)
tl;dr I am hoping the setup for wikipage view counts (based on Varnish UDP
logging) could be reused for thumbnails, and I am asking for advice on that.
Hi all,
while the tracking of user behavior on Wikimedia sites is generally in a
sad state, there is a decent-enough external tool [1] for telling which
Wikipedia articles are popular and how that popularity changes day-to-day.
The same is not true for images, for which there are no public view
statistics at all. As a poor but still-better-than-nothing tool, people
have been looking and file description page view counts to get an estimate
of the level of interest in an image. This will be rendered mostly useless
by MediaViewer (per our logs, only about 2% of readers follow through to
the file page), which makes GLAM people sad.
I am looking into alternative ways for supporting image curation with usage
statistics. I can see four use cases here:
1. track how often an image is seen
2. track how many people are interested in an image
3. track how many people download/reuse an image
4. track how many people are interested in image metadata (e.g. GLAMs like
to know not only how many people have seen the image, but also how many of
them have seen which institution it comes from)
I haven't yet collected much information on how much people actually want
each of these; I would like to understand first which of them are
realistically doable. I'll share my vague plans on how to implement them; I
would appreciate if you could poke holes in them / suggest better solutions.
== Tracking image view counts (including thumbnails) ==
This seems largely analogous to how page view counts are tracked: in one
case we need to know Varnish html hit counts, in the other Varnish image
hit counts.
For the page view counts, according to [2] Varnishes send an UDP packet to
a statistics server whenever a page is requested; the results are
aggregated in hourly buckets, published at [2], then further aggregated
into daily buckets, put into an SQL database and visualised on a 3rd party
server.
This seems to be easily replicable for thumbnails / image originals:
deduplicate by file name (and possibly some sort of thumbnail size
buckets), save in dump files, ask Henrik to include them in
stats.grok.se(it would be probably easy to hack them into the current
system if they get
fake wikidb names). It would be even more reliable than page view counts
because image redirects work differently from article redirects, and don't
split the view counts forever.
The big drawback is caching: while HTML pages are not cached (in the sense
that the browser always sends a request for them), most images are [3]. I
don't think this could be realistically helped (in theory it could be
possible to calculate from the page view stats and the imagelinks table the
exact number of times an image has been displayed, but doing that on the
scale of Wikipedia pageviews is not plausible), and it might even be
considered a good thing: instead of view counts, we get a reasonable
approximation of unique visitors (image viewers).
== Tracking people "expressing interest" in a file (whatever that means) ==
Knowing how many people see the thumbnail of a file is important, but -
unlike article view counts - it does not really say how many people show
any interest in an image. It might just happen to be included in an article
they are interested in; maybe they never scrolled to the bottom of the
article and haven't seen the actual image at all. So it is useful to know
how many people clicked through the image.
Without MediaViewer, file page view counts can (and are) used for this
(even if they have their own problems for Commons images); with MediaViewer
there would have to be a way to tell apart a thumbnail that was requested
for use in MediaViewer vs. for some other reason. This could be done
similarly to how the ?download query parameters are handled: append a
source=mediaviewer URL parameter to the filename, have Varnish rewrite it
to avoid cache splitting, add the parameter to the UDP packet and filter on
it when processing the logs. This would be a fairly generic mechanism which
could be potentially used for other things (source=filepage,
source=hovercards etc); it would still split the cache on the browser side,
but given that the large thumbnails used by MediaViewer don't overlap much
with the thumbnail sizes used on wiki pages, this does not seem tragic.
== Tracking the number of downloads/reuses ==
MediaViewer adds a ?download parameter to download links (to ensure that
the Content-Disposition header is set), which could be used to track
downloads. (It would miss downloads from other sources; I'm not sure there
is a generic solution. Maybe something based on referrer or some other
header, in case the browsers set those differently for downloaded and
displayed images?)
Reuse is too vague a category to say anything about, but tracking views of
an image which originate outside the Wikimedia universe seems possible, but
too much effort and too different from the previous methods to be worth
spending time on now.
== Tracking the interest in image metadata ==
MediaViewer collects global statistics on the ratio of people viewing an
image vs. following through to the file page vs. scrolling down to open the
panel with file metadata information. I don't think a per-file tracking of
this would be useful or worth the effort; if it is needed, it would have to
be done by some EventLogging-ish setup with MediaViewer creating tracking
gifs whenever the metadata information is opened.
What do you think, is any of this plausible?
[1] http://stats.grok.se/
[2] http://dumps.wikimedia.org/other/pagecounts-raw/
[3] looking at the effects of a page load in the network tab, it seems that
most thumbnails on a page are loaded from the browser cache, while there
are a few which are requested from the servers which respond with a 304.
Which images do that is deterministic but otherwise seems totally random;
e.g. on the current enwiki mainpage, RalphBakshiJan09.jpg is loaded from
cache while Prayuth_Jan-ocha_2010-06-17_ITN.jpg is always requested. I am
curious about the reason for this.
Hi all,
the Multimedia team is preparing to collect data to better understand
usability problems with UploadWizard. UW has a "checkout" structure (step
1: put files in basket, step 2: choose license, step 3: add description,
step 4: you are done), so a funnel analysis to identify which step causes
the most users to abort the upload process and why seems like a good
approach. I'm trying to understand how well the existing EventLogging
infrastructure supports this.
The problem is how to get information about the actions of users who fell
out of the funnel. I'll try to illustrate with an example: in one of the
steps, the user can choose between "I am uploading my own work" and "I am
uploading someone else's work" and the resulting interaction will be quite
different. We would like to know whether that choice has a big effect on
the likeliness of the user making it to the next step.
Using EventLogging, I can count the number of users who make it until that
step. I can count the number of users making it to the next step. I can
count the number of users choosing this or that author option. These
numbers do not tell us much on their own, though; the interesting
information would be how they are correlated.
Another thing I could do is creating a schema which includes both the
choice of author option and the step number; when the user chooses "own
work", we log an ownwork event, when they click "next step", we log a
step(step=3, work=own) event. We can then calculate the number of users who
did choose "own work" but did not make it to the next step as the
difference of the two. But this won't work: "own work" is a radio button,
the user select and deselect it any number of times before proceeding to
the next step (or leaving the page).
So what we are trying to log are not really events but application states
that describe users who are successful vs. unsuccessful in the given step.
I thought of two ways of dealing with this; any feedback on the
plausibility of these or possible alternatives would be highly appreciated.
One would be to have a "step X succeeded" and a "step X failed" event (the
schema for which could include all sorts of state, such as which authorship
option was selected). This would require the ability to log an event when
the user leaves the page. I see two ways two do that:
- send the event log as a synchronous request from an unload event handler.
This is not supported on ancient browsers; also, there is probably some
mechanism in most browsers to kill an unload event handler if it takes long.
- store the event in cookies/localStorage, log it on the next page load.
This works in all browsers but it is less reliable (what if the user never
comes back?) and logs the event for a different page load from where it
actually occurred (what if the user comes back after a month?), and
probably runs int all sorts of complications with multiple tabs.
The other way could be to log event chains: set a random identifier (which
only lives until the page is unloaded), and add it to every event. Event
groups can then be merged into meta-events by SQL magic, although that
looks like it will be extremely painful to do. On the other hand, this is
much more generic than the previous method, and could be used to answer
more complex questions.
What do you think? Which would be the method I am not shooting myself in
the foot with? Currently I am leaning towards using unload handlers.
This is a summary of my discussion with Maryana on mobile instrumentation needs:
Registration tagging
The ServerSideAccountCreation log only knows about mobile vs desktop registrations. Down the line we should modify it to tag new accounts by source {desktop, mobile web, mobile app, other}. For now, we can just use the existing log and extract the source by combining the userAgent and displayMobile fields. There are comments on the Schema:MobileWikiAppCreateAccount talk page that require feedback from the mobile team, if we intend to use this log for signups on apps.
Revision tagging
We propose the creation of a new “app” tag in MediaWiki on top of the current “mobile edit” tag, which we will keep for backward-compatibility. This will give us an easy way of counting all mobile edits combined or app-only edits. Detailed breakdown of edit volume/quality by app platform or version will be done via EventLogging instrumentation.
App instrumentation
Platform-specific logs such as Schema:MobileWikiAppEdit are an ok interim solution but we should revamp the original plans for a consolidated edit funnel instrumentation to be able to compare edit funnel events across different platforms and interfaces. Given that this data is not product/feature specific, it’s likely an Analytics responsibility to implement it.
Tablet edits
We still don’t have a consistent way of identifying and counting edits made on tablets (on our desktop or mobile site), but we can get some initial estimates by using UA data from the last 30 days.
Mobile dashboards
We will review the existing mobile dashboards to figure out which ones should be eventually migrated to Vital Signs.
Dario
The next Research & Data showcase will be live-streamed today Wed 5/21 at 11.30 PT.
The streaming link will be posted on the lists a few minutes before the showcase starts and as usual you can join the conversation on IRC at #wikimedia-research.
We look forward to seeing you!
Dario
This month:
UX research at WMF
Introducing Abbey Ripstra, the new UX research lead at the Wikimedia Foundation.
A bird's eye view of editor activation
by Dario Taraborelli -- In this talk I will give a high-level overview of data on new editor activation, presenting longitudinal data from the largest Wikipedias, a comparison between desktop and mobile registrations and the relative activation rates of different cohorts of newbies.
Collaboration patterns in Articles for Creation
by Aaron Halfaker -- Wikipedia needs to attract and retain newcomers while also increasing the quality of its content. Yet new Wikipedia users are disproportionately affected by the quality assurance mechanisms designed to thwart spammers and promoters. English Wikipedia’s en:WP:Articles for Creation provides a protected space for newcomers to draft articles, which are reviewed against minimum quality guidelines before they are published. In this presentation, describe and a study of how this drafting process has affected the productivity of newcomers in Wikipedia. Using a mixed qualitative and quantitative approach, I'll show the process's pre-publication review, which is intended to improve the success of newcomers, in fact decreases newcomer productivity in English Wikipedia and offer recommendations for system designers.