Cross posting. See follow up discussion on analytics list web archive.
---------- Forwarded message ----------
From: *Dan Andreescu* <dandreescu(a)wikimedia.org>
Date: Friday, June 5, 2015
Subject: [Analytics] Pageview API Status update
To: Analytics List <analytics(a)lists.wikimedia.org>
I just posted a comment on the famous task:
Here it is for those who would rather discuss on this list:
We have finished analyzing the intermediate hourly aggregate with all the
columns that we think are interesting. The data is too large to query and
anonymize in real time. We'd rather get an API out faster than deal with
that problem, so we decided to produce smaller "cubes"  of data for
specific purposes. We have two cubes in mind and I'll explain those here.
For each cube, we're aiming to have:
* Direct access to a postgresql database in labs with the data
* API access through RESTBase
* Mondrian / Saiku access in labs for dimensional analysis
* Data will be pre-aggregated so that any single data point has k-anonymity
(we have not determined a good k yet)
* Higher level aggregations will be pre-computed so they use all data
And, the cubes are:
**stats.grok.se Cube: basic pageview data**
Hourly resolution. Will serve the same purpose as stats.grok.se has served
for so many years. The dimensions available will be:
* project - 'Project name from requests host name'
* dialect - 'Dialect from requests path (not set if present in project
* page_title - 'Page Title from requests path and query'
* access_method - 'Method used to access the pages, can be desktop, mobile
web, or mobile app'
* is_zero - 'accessed through a zero provider'
* agent_type - 'Agent accessing the pages, can be spider or user'
* referer_class - 'Can be internal, external or unknown'
**Geo Cube: geo-coded pageview data**
Daily resolution. Will allow researchers to track the flu, breaking news,
etc. Dimensions will be:
* project - 'Project name from requests hostname'
* page_title - 'Page Title from requests path and query'
* country_code - 'Country ISO code of the accessing agents (computed using
MaxMind GeoIP database)'
* province - 'State / Province of the accessing agents (computed using
MaxMind GeoIP database)'
* city - 'Metro area of the accessing agents (computed using MaxMind GeoIP
So, if anyone wants another cube, **now** is the time to speak up. We'll
probably add cubes later, but it may be a while.
 OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube
Hi People Interested in Gather,
Given the reorg and the traffic being driven to beta, we need to revisit
Gather's Q4 goals:
TLDR: Continue working on gather to finish MVP features (end of the month,
max), not pushing to stable unless we see 2x the number of logged-in edits
on beta (or >10x current state).
Before the serious stuff:
Top 5 best collections since Monday:
Okay, now the goals. Please feel free to comment in email or in this google
Given the reorg and an improvement made to beta, we need to revisit
Pushing to Stable
Given that we can now test Gather numbers in beta, the original goal of
launching on stable to test adoption is no longer valid. It is expensive
in terms of future maintenance and commitments to launch features to
stable, so we only want to do so after we have proven success, if possible.
Originally, the benchmarks for success on stable that were agreed on were
low (10k creators a month on stable, and 1k shares). Given the current
beta numbers, it looks like we will blow the first number out of the
water. Share has been deprioritized given current usage patterns.
However, given that we now have 4 engineers in charge of the entire web,
the standards for what we work on have to be more rigorous. We cannot
allocate multiple engineers to an experimental product unless it shows
promise of impacting greater numbers
Round out Gather hypothesis (criteria below)
By end of June know whether or not we want to push Gather to stable
1. Next Eng+PM Steps (to round out hypothesis)
Improve onboarding (a few tasks)
Surface collections publicly (this is a big missing feature)
Qualitative and quantitative research
2. Criteria for passing to stable:
We don’t have a great way to measure success of Gather based on usage by a
proportion of users or logged in users. However, we can compare to
something similar like edits. In terms of pure value to WP, we can
consider a collection to be like a low-value edit.
There are 2x ‘good’ collections made as there are total logged-in
in May there were 2,180 edits by logged in users (2,694 logged out).
At current rates this suggests ~4,500 collections per month is our
‘good’ here, means >1 collection--it’s not a perfect definition, but
it’s a strong proxy.
Our current rate of ‘good’ collections is roughly 250 a month, so we
will need to increase the number by almost 20x.
It might be worth exploring % of those 2,180 edits that are reverted
and adjusting down accordingly
Views of collections or where collection is the referrer > .5% of total
Currently, the views of collections are minimal.
If very few people create collections, but they drive a significant
boost in page views, say .5% of total PVs (not an end goal, but also not
bad for an MVP introducing a new use case), then the feature is a success.
1. What if we don’t pass to stable:
Lets burn that bridge when we get to it :) . Seriously though, until we get
some qualitative data back from our readers, we will not be able to make
important calls on the feature as is. There are a few great alternatives I
can think of right off the bat:
Keep code in beta and work on Gather opportunistically or as qualitative
One example might be to make collections private by default and
launch as bookmarker for readers (primary current use-case)
Promote as beta feature on desktop
Use codebase as start of multiple watchlists (some good work started
here by JRobson)
Codebase is fairly generic list table that has some basic and
interesting features built in that could be used for a number of other ends
1. Validation and why we choose this criteria:
Qualitative questions to answer:
Why aren't more people using Gather? (correctly)
Why aren't people returning to use Gather more
Success metric questions to answer (that we can’t already):
What % of logged in users use Gather
What % of users who visit > 1 page use Gather
What is our denominator
Measure logins and signup funnel directed from Gather, what is % success
This is not instrumented
Measure % of sessions with more than 1 pageview
Working on this
Measure % of sessions with logged in users
Cannot get this
What is our baseline? Logged in edits is probably the best thing to
Cross posting for visibility. This kind of stuff gets posted to the API
lists, so be sure to subscribe to those if you aren't already subscribed to
---------- Forwarded message ----------
From: Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org>
Date: Tue, Jun 2, 2015 at 1:42 PM
Subject: [Wikitech-l] API BREAKING CHANGE: Default continuation mode for
action=query will change at the end of this month
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>,
As has been announced several times (most recently at
the default continuation mode for action=query requests to api.php will be
changing to be easier for new coders to use correctly.
*The date is now set:* we intend to merge the change to ride the deployment
train at the end of June. That should be 1.26wmf12, to be deployed to test
wikis on June 30, non-Wikipedias on July 1, and Wikipedias on July 2.
If your bot or script is receiving the warning about this upcoming change
(as seen here
example), it's time to fix your code!
- The simple solution is to simply include the "rawcontinue" parameter
with your request to continue receiving the raw continuation data (
No other code changes should be necessary.
- Or you could update your code to use the simplified continuation
documented at https://www.mediawiki.org/wiki/API:Query#Continuing_queries
which is much easier for clients to implement correctly.
Either of the above solutions may be tested immediately, you'll know it
works because you stop seeing the warning.
I've compiled a list of bots that have hit the deprecation warning more
than 10000 times over the course of the week May 23–29. If you are
responsible for any of these bots, please fix them. If you know who is,
please make sure they've seen this notification. Thanks.
Brad Jorsch (Anomie)
Wikitech-l mailing list
This version has a lot of visual updates, performance improvements, and bug
fixes - we even tweaked the icon. Overall this is a pretty nice release
that applies some much needed polish in a lot of places.
Please check it out when you get a chance and let us know what you think.
Mobile Apps / iOS
Just wanted to send my notes on the hack demos from this weekend, and
potentially start a discussion about shipping some of them!
*Apple Watch: *Corey & Jason did an incredible job on this, and I'm
essentially sold that a watch companion app could augment our *read later* and
*search* flows for what seems like a reasonable development cost.
Not sure who did it, but server-side image/upload validation could be
further facilitate cross-platform mobile uploads?
*Image surface content gap*: IOW, which pages are in the most need of
pictures. Seems like a "micro edit" workflow that could work well for apps
when combined with location/geo-fencing.
*Haikus from recent changes:* More of a technical inspiration: I'd love to
discuss building a service that sends push notifications in response to RC
*Dmitry's "Wikipedia Lite" hack*: I think we should do some prototypes on
this. In particular, I'd like to play around with parsoid to create a
"reader view" for Wikipedia pages. Aside from streamlining content for
mobile and improving performance, this could also make it easier to bring
back a lot of reader-centric features. Dynamic font sizes & color/contrast
configurations are two things I see pop up from time to time in OTRS &
*Bernd's map view for Nearby:* This seems like a no-brainer. I sent a
separate email to mobile-l, because it seems like some progress has been
made in this area since I last looked into it.
What were your favorite hacks?
EN Wikipedia user page: https://en.wikipedia.org/wiki/User:Brian.gerstle
Thanks for sharing this, Adam. Aside from engagement/funnel data, the critical question for this feature is: does it bring back eyeballs to the site from social media? It looks like it doesn’t yet, at least not in a substantial way, even with the caveat that App traffic is a very small fraction of total mobile traffic.
Having looked into referrals for this feature before and after comparing them to Twitter’s own engagement analytics (and finding some big discrepancy), you should consider removing spiders/crawlers from the data (see ) to avoid inflating pageviews with non-human activity.
I’m a big fan of this feature and look forward to seeing how you guys intend to scale it.
 https://github.com/ewulczyn/wmf/blob/b9f726ee3468852c3fed2780af1d8ac0004eda… <https://github.com/ewulczyn/wmf/blob/b9f726ee3468852c3fed2780af1d8ac0004eda…>
> On May 21, 2015, at 12:37 PM, Toby Negrin <tnegrin(a)wikimedia.org> wrote:
> Hi all - some interesting analysis on the share-a-fact feature from the mobile team.
> Begin forwarded message:
>> From: Adam Baso <abaso(a)wikimedia.org <mailto:firstname.lastname@example.org>>
>> Date: May 21, 2015 at 12:05:29 PDT
>> To: mobile-l <mobile-l(a)lists.wikimedia.org <mailto:email@example.com>>
>> Subject: [WikimediaMobile] Share a Fact Initial Analysis
>> Hello all,
>> We’ve been looking at some initial results from the Share a Fact feature introduced on the Wikipedia apps for Android and iOS in its basic "minimal viable product" implementation. Here’s some analysis, using data from one day (20150512) with respect to the latest stable versions of the apps (2.0-r-2015-04-23 on Android and 4.1.2 on iOS) for that day.
>> * On iOS, when a user initiates the first step of the default sharing workflow - tapping the up-arrow box share button (6,194 non-highlighting instances for the day under question) - about 11.7% of the time it yielded successful sharing.
>> * On Android, it’s not possible to easily tell when the sharing workflow was carried through to successful share, but we anticipate the Android success rate is currently much higher, as general engagement percentage up to the point of picking an app for sharing is higher on Android than on iOS.
>> * On Android, when presented with the share card preview, 28.0% of the time the ‘Share as image’ button was tapped and 55.5% of the time the 'Share as text' button was tapped, whereas on iOS it was 8.4% ‘Share as image’ and 16.8% ‘Share as text’.
>> * The forthcoming 4.1.4 version of the iOS app will relax its default sharing snippet generation rules and be more like the Android version in that respect. We anticipate this will result in higher engagement with both the ‘Share as image’ and ‘Share as text’ buttons on iOS, and we should be able to verify this once the 4.1.4 iOS version is released and generally adopted (usually takes 4-5 days after release; the 4.1.4 release isn’t released yet).
>> * On the Android app the ‘Share’ option is located on the overflow menu, not as part of the main set of UI buttons. This potentially increases the likelihood of Android users being primed to step through the workflow. On the iOS app, the share button (up-arrow box) is plainly visible from the main UI and not an overflow menu, and this probably creates a different priming dynamic for the iOS demographic.
>> * When users on iOS tapped on the ‘Share as image’ or ‘Share as text’ buttons, there is a pretty sharp drop off at the next stage - the system sharesheet. Once the sharesheet was presented to iOS users, 41.6% of the time it resulted in active abandonment. We believe this probably has something to do with the relatively small set of default apps listed on the sharesheet and the extra work involved with exposing additional social apps for sharing in that context. As with the Android app, the labels of ‘Share as image’ and ’Share as text’ may also pose something of a hurdle at least for first time users of the feature. To this end, there is an onboarding tutorial planned at least on Android.
>> * For a one hour period (2015051201) there were about 100 pageviews in some sense attributable to Share a Fact using a provenance parameter available on the latest stable versions of the apps at that time; this may slightly overstate the number of pageviews attributable to the two specific apps reviewed in this analysis, but probably not too much (n.b., previously a different source parameter was used than the new wprov provenance parameter). Pageviews are not the sole motivation for the feature, but following the trendline over the long run should be interesting. Impact on social media and the destinations of shares is a little harder to capture directly, but https://twitter.com/search?f=realtime&q=%40wikipedia%20-%40itzwikipedia%20f… <https://twitter.com/search?f=realtime&q=%40wikipedia%20-%40itzwikipedia%20f…> gives one a sense about image shares, at least.
>> * A couple potential options for increasing sharing include:
>> ** Trying to add support for sharing to the Photos app on iOS. People may be interested in using images from the Photos apps for various workflows, as Dan Garry has noted.
>> ** Offering a more concise app picklist, in particular explicitly adding the native OS app components (namely, Twitter and Facebook, and as mentioned, Photos if possible), with an option to expose the sharesheet for additional options if necessary. This is probably also somewhat confined to iOS, although conceivably a similar approach could be possible on Android. On Android the full list of applications in its equivalent of the sharesheet is by default readily available to the user, though.
>> ** On Android, exposing the diagonal arrow share button on the main interface akin to how the iOS version of the app shows the up-arrow share button. This may introduce more opportunities for sharing (and thus numbers of abandons would go up in tandem with numbers of shares), but would also partially clutter the interface and probably increase abandon. A controlled experiment may be useful for observing the impact of such an approach.
>> * As a point of reference, for the app versions in scope for this analysis over a single day, there appeared to be approximately 3.78 million Wikipedia for Android pageviews and 1.19 Wikipedia Mobile for iOS app pageviews. There were about 6.73 million app pageviews on the “modern” versions of these apps total for this particular day, meaning there were about 1.75 million pageviews on other modern versions of the app.
>> * Examination of the categories of successful shares on iOS showed the following distributions:
>> 48.5% messaging
>> 25.5% sharesheet copy
>> 22.9% social
>> 1.8% productivity
>> 0.9% reading
>> 53.6% messaging
>> 31.9% sharesheet copy
>> 7.1% social
>> 5.4% reading
>> 2.0% productivity
>> Here were some queries used in the analysis:
>> == SHARE A FACT ATTRIBUTABLE PAGEVIEWS FOR ONE HOUR ==
>> select wprov, uri_host, count(*) from (select x_analytics_map['wprov'] as wprov, uri_host
>> from webrequest where year = 2015 and month = 5 and day = 12 and hour = 1 and is_pageview = true and uri_host like '%.wikipedia.org <http://wikipedia.org/>' and x_analytics_map['wprov'] is not null) t
>> group by wprov, uri_host;
>> == PAGE VIEWS FOR THE DAY FOR THE “MODERN” VERSIONS OF THE APPS ==
>> user_agent, count(*)
>> tablesample(BUCKET 1 OUT OF 100 ON rand())
>> YEAR = 2015
>> AND MONTH = 5
>> AND DAY = 12
>> AND is_pageview = TRUE
>> AND lower(uri_host) like '%.wikipedia.org <http://wikipedia.org/>'
>> AND user_agent like 'WikipediaApp%'
>> GROUP BY user_agent;
>> == HIGHLIGHTING SESSION CASE FOR SPECIFIC VERSIONS OF THE APPS ==
>> select CASE WHEN t2.userAgent LIKE 'WikipediaApp/2.0-r-2015-04-23%' THEN 'Android' WHEN t2.userAgent LIKE 'WikipediaApp/4.1.2%' THEN 'iOS' END AS 'ua', t1.event_action, t1.event_sharemode, t1.event_target, count(*) from MobileWikiAppShareAFact_11331974 t1 inner join MobileWikiAppShareAFact_11331974 t2 on t1.event_shareSessionToken = t2.event_shareSessionToken where t1.timestamp > '20150512' and t1.timestamp < '20150513' and t2.timestamp > '20150512' and t2.timestamp < '20150513' and t1.event_action != 'highlight' and t2.event_action = 'highlight' and (t2.userAgent like 'WikipediaApp/2.0-r-2015-04-23%' or t2.userAgent like 'WikipediaApp/4.1.2%') group by ua, t1.event_action, t1.event_sharemode, t1.event_target;
>> == NON-HIGHLIGHTING SESSION CASE FOR SPECIFIC VERSIONS OF THE APPS ==
>> n.b., subtract the highlighting cases from the non-highlighting cases to arrive at the default sharing behavior. Technically, inner joins can be used to do more comprehensive session analysis, but the queries take a long time.
>> select CASE
>> WHEN userAgent LIKE 'WikipediaApp/2.0-r-2015-04-23%' THEN 'Android'
>> WHEN userAgent LIKE 'WikipediaApp/4.1.2%' THEN 'iOS'
>> END AS 'ua', event_action, event_sharemode, event_target,
>> count(*) from MobileWikiAppShareAFact_11331974 where timestamp > '20150512' and timestamp < '20150513' and (userAgent like 'WikipediaApp/2.0-r-2015-04-23%' or userAgent like 'WikipediaApp/4.1.2%') group by ua, event_action, event_sharemode, event_target;
>> Mobile-l mailing list
>> Mobile-l(a)lists.wikimedia.org <mailto:Mobilefirstname.lastname@example.org>
>> https://lists.wikimedia.org/mailman/listinfo/mobile-l <https://lists.wikimedia.org/mailman/listinfo/mobile-l>
> Analytics mailing list