I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though.
I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer?
My username is rbaasland and I would like to contribute to the analytics
project. I was wondering if I could have access to the project, or how I go
about contributing to this project?
Thank you very much,
I realized I don't get any responses from internal--but Joseph sent me
something helpful to me this morning so I saw all the responses..up to that
point. I think.
Anyway, thanks for the help!! The strange thing for me seems to be that
the numbers I get don't make that much sense to me.
For beta, (using query below) I get:
Unique IPs num_pvs referrer
3638 5967 external
1972 5760 internal
I would have expected a much larger external-->internal referrer ratio. In
other words, I would have expected that the vast majority of sessions or
even ips only hit the site 1x in a given hour. Instead, I am seeing that
54% of IPs are clicking a link within that hour... I would probably expect
to see #'s no more than 10%.
I am probably doing something wrong, right? I *know* that I am making
convenient assumptions here that do not apply to edge cases, so let's not
consider those unless you think they make a big difference. Perhaps by
using the referer field I am inherently leaving out all of the external
traffic for which we do not have data?
COUNT(DISTINCT ip) AS Unique_IPs,
x_analytics_map['mf-m'] AS mobile_site, count(*) AS num_pvs,
CASE WHEN referer LIKE "%en.m.wikipedia%" THEN 'internal' ELSE 'external'
END AS session_depth
WHERE TRUE = TRUE
AND webrequest_source = 'mobile'
AND year = 2015
AND month = 5
AND day = 25
and hour = 1
AND agent_type = "user"
AND is_pageview = TRUE
AND x_analytics_map['mf-m'] IS NOT NULL
AND uri_host like "%en.m.wikipedia.org%"
CASE WHEN referer LIKE "%en.m.wikipedia%" THEN 'internal' ELSE 'external'
ORDER BY hits DESC
On Thu, May 28, 2015 at 2:30 PM, Jon Katz <jkatz(a)wikimedia.org> wrote:
> Trying to run a hive query to rough-count number of 1-page-only,
> 'sessions' on mobile-web Here is the error I get
> FAILED: ParseException line 15:22 missing KW_END at 'device_family' near
> line 15:35 missing EOF at ''] <> "Spider"\n AND is_pageview = TRUE\n AND
> x_analytics_map['' near 'device_family
> Here is the query:
> COUNT(DISTINCT ip) AS hits,
> x_analytics_map['mf-m'] AS mobile_site, count(*) AS num_pvs,
> WHEN referer LIKE "%en.m.wikipedia%"
> THEN 'internal'
> ELSE 'Misc’
> END AS session_depth
> YEAR = 2015
> AND MONTH = 5
> AND DAY = 25
> AND user_agent_map['device_family'] <> "Spider"
> AND is_pageview = TRUE
> AND x_analytics_map['mf-m'] IS NOT NULL
> AND uri_host like "%en.m.wikipedia.org%"
> GROUP BY session_depth, mobile_site
> ORDER BY hits DESC
> LIMIT 50;
> Any advice?
Are there any easy to see statistics about the survival rate of
newly-created pages in Wikipedias in different languages?
I need this for understanding the success of ContentTranslation, which is
primarily an article creation tool
I couldn't find something like this in stats.wikimedia.org. It does have
the number of created pages per day. For en.wikipedia, for example, it's
about 800. But how many are deleted the same day ("speedy")? Knowing that
alone would be very useful, and there are other possible questions, such
as: How many are deleted within a week or a month? What is the age
distribution of the articles that are deleted every day - how many of them
were created the same day, how many were created a year ago, and so on.
Using a simple (and possibly wrong - I don't do this often) query, I
found that around 500 or 600 deletions happen each day in the English
Wikipedia. Does this sound sensible? Is there a better query that I could
run, or a dashboard where I could see such a thing conveniently? And of
course, I'd love to see it for all languages and not just English.
Thanks for any help!
 SELECT max(ar_id), ar_title, ar_timestamp FROM `archive` WHERE
ar_namespace = 0 and ar_timestamp between 20150521000000 and 20150521999999
group by ar_title ORDER BY NULL;
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
“We're living in pieces,
I want to live in peace.” – T. Moore
Hi all - some interesting analysis on the share-a-fact feature from the mobile team.
Begin forwarded message:
> From: Adam Baso <abaso(a)wikimedia.org>
> Date: May 21, 2015 at 12:05:29 PDT
> To: mobile-l <mobile-l(a)lists.wikimedia.org>
> Subject: [WikimediaMobile] Share a Fact Initial Analysis
> Hello all,
> We’ve been looking at some initial results from the Share a Fact feature introduced on the Wikipedia apps for Android and iOS in its basic "minimal viable product" implementation. Here’s some analysis, using data from one day (20150512) with respect to the latest stable versions of the apps (2.0-r-2015-04-23 on Android and 4.1.2 on iOS) for that day.
> * On iOS, when a user initiates the first step of the default sharing workflow - tapping the up-arrow box share button (6,194 non-highlighting instances for the day under question) - about 11.7% of the time it yielded successful sharing.
> * On Android, it’s not possible to easily tell when the sharing workflow was carried through to successful share, but we anticipate the Android success rate is currently much higher, as general engagement percentage up to the point of picking an app for sharing is higher on Android than on iOS.
> * On Android, when presented with the share card preview, 28.0% of the time the ‘Share as image’ button was tapped and 55.5% of the time the 'Share as text' button was tapped, whereas on iOS it was 8.4% ‘Share as image’ and 16.8% ‘Share as text’.
> * The forthcoming 4.1.4 version of the iOS app will relax its default sharing snippet generation rules and be more like the Android version in that respect. We anticipate this will result in higher engagement with both the ‘Share as image’ and ‘Share as text’ buttons on iOS, and we should be able to verify this once the 4.1.4 iOS version is released and generally adopted (usually takes 4-5 days after release; the 4.1.4 release isn’t released yet).
> * On the Android app the ‘Share’ option is located on the overflow menu, not as part of the main set of UI buttons. This potentially increases the likelihood of Android users being primed to step through the workflow. On the iOS app, the share button (up-arrow box) is plainly visible from the main UI and not an overflow menu, and this probably creates a different priming dynamic for the iOS demographic.
> * When users on iOS tapped on the ‘Share as image’ or ‘Share as text’ buttons, there is a pretty sharp drop off at the next stage - the system sharesheet. Once the sharesheet was presented to iOS users, 41.6% of the time it resulted in active abandonment. We believe this probably has something to do with the relatively small set of default apps listed on the sharesheet and the extra work involved with exposing additional social apps for sharing in that context. As with the Android app, the labels of ‘Share as image’ and ’Share as text’ may also pose something of a hurdle at least for first time users of the feature. To this end, there is an onboarding tutorial planned at least on Android.
> * For a one hour period (2015051201) there were about 100 pageviews in some sense attributable to Share a Fact using a provenance parameter available on the latest stable versions of the apps at that time; this may slightly overstate the number of pageviews attributable to the two specific apps reviewed in this analysis, but probably not too much (n.b., previously a different source parameter was used than the new wprov provenance parameter). Pageviews are not the sole motivation for the feature, but following the trendline over the long run should be interesting. Impact on social media and the destinations of shares is a little harder to capture directly, but https://twitter.com/search?f=realtime&q=%40wikipedia%20-%40itzwikipedia%20f… gives one a sense about image shares, at least.
> * A couple potential options for increasing sharing include:
> ** Trying to add support for sharing to the Photos app on iOS. People may be interested in using images from the Photos apps for various workflows, as Dan Garry has noted.
> ** Offering a more concise app picklist, in particular explicitly adding the native OS app components (namely, Twitter and Facebook, and as mentioned, Photos if possible), with an option to expose the sharesheet for additional options if necessary. This is probably also somewhat confined to iOS, although conceivably a similar approach could be possible on Android. On Android the full list of applications in its equivalent of the sharesheet is by default readily available to the user, though.
> ** On Android, exposing the diagonal arrow share button on the main interface akin to how the iOS version of the app shows the up-arrow share button. This may introduce more opportunities for sharing (and thus numbers of abandons would go up in tandem with numbers of shares), but would also partially clutter the interface and probably increase abandon. A controlled experiment may be useful for observing the impact of such an approach.
> * As a point of reference, for the app versions in scope for this analysis over a single day, there appeared to be approximately 3.78 million Wikipedia for Android pageviews and 1.19 Wikipedia Mobile for iOS app pageviews. There were about 6.73 million app pageviews on the “modern” versions of these apps total for this particular day, meaning there were about 1.75 million pageviews on other modern versions of the app.
> * Examination of the categories of successful shares on iOS showed the following distributions:
> 48.5% messaging
> 25.5% sharesheet copy
> 22.9% social
> 1.8% productivity
> 0.9% reading
> 53.6% messaging
> 31.9% sharesheet copy
> 7.1% social
> 5.4% reading
> 2.0% productivity
> Here were some queries used in the analysis:
> == SHARE A FACT ATTRIBUTABLE PAGEVIEWS FOR ONE HOUR ==
> select wprov, uri_host, count(*) from (select x_analytics_map['wprov'] as wprov, uri_host
> from webrequest where year = 2015 and month = 5 and day = 12 and hour = 1 and is_pageview = true and uri_host like '%.wikipedia.org' and x_analytics_map['wprov'] is not null) t
> group by wprov, uri_host;
> == PAGE VIEWS FOR THE DAY FOR THE “MODERN” VERSIONS OF THE APPS ==
> user_agent, count(*)
> tablesample(BUCKET 1 OUT OF 100 ON rand())
> YEAR = 2015
> AND MONTH = 5
> AND DAY = 12
> AND is_pageview = TRUE
> AND lower(uri_host) like '%.wikipedia.org'
> AND user_agent like 'WikipediaApp%'
> GROUP BY user_agent;
> == HIGHLIGHTING SESSION CASE FOR SPECIFIC VERSIONS OF THE APPS ==
> select CASE WHEN t2.userAgent LIKE 'WikipediaApp/2.0-r-2015-04-23%' THEN 'Android' WHEN t2.userAgent LIKE 'WikipediaApp/4.1.2%' THEN 'iOS' END AS 'ua', t1.event_action, t1.event_sharemode, t1.event_target, count(*) from MobileWikiAppShareAFact_11331974 t1 inner join MobileWikiAppShareAFact_11331974 t2 on t1.event_shareSessionToken = t2.event_shareSessionToken where t1.timestamp > '20150512' and t1.timestamp < '20150513' and t2.timestamp > '20150512' and t2.timestamp < '20150513' and t1.event_action != 'highlight' and t2.event_action = 'highlight' and (t2.userAgent like 'WikipediaApp/2.0-r-2015-04-23%' or t2.userAgent like 'WikipediaApp/4.1.2%') group by ua, t1.event_action, t1.event_sharemode, t1.event_target;
> == NON-HIGHLIGHTING SESSION CASE FOR SPECIFIC VERSIONS OF THE APPS ==
> n.b., subtract the highlighting cases from the non-highlighting cases to arrive at the default sharing behavior. Technically, inner joins can be used to do more comprehensive session analysis, but the queries take a long time.
> select CASE
> WHEN userAgent LIKE 'WikipediaApp/2.0-r-2015-04-23%' THEN 'Android'
> WHEN userAgent LIKE 'WikipediaApp/4.1.2%' THEN 'iOS'
> END AS 'ua', event_action, event_sharemode, event_target,
> count(*) from MobileWikiAppShareAFact_11331974 where timestamp > '20150512' and timestamp < '20150513' and (userAgent like 'WikipediaApp/2.0-r-2015-04-23%' or userAgent like 'WikipediaApp/4.1.2%') group by ua, event_action, event_sharemode, event_target;
> Mobile-l mailing list
I was advised to write to the list and look for help here :)
The situation itself is explained here:
We have the site of our chapter, Wikimedia RU, located on WMF servers and
that's why we have limited access to the management of the site.
We need some statistics about visitors, views, etc. (and we do agree to
but the statistic itself is essential: we promote our website via Google
Ad, we are going to gather funds via our website - that's why we need to
understand the effectiveness of our efforts.
So, what would you suggest for us in this case?
We thought about Piwik extension for the website but the ticket has no
movement for quite a long time.
2015-05-25 0:30 GMT+03:00 Aaron Halfaker <aaron.halfaker(a)gmail.com>:
> Analytics public mailing list
> Send a message here explaining what you are looking for. I'll work with
> Analytics Dev. to discuss a solution that we could put in place.
http://searchdata.wmflabs.org/ - boop! This was my Friday. Previously
we were playing around with them and testing what we needed with a
static snapshot; these dashboards will now update once a day with new
It has turned up some bugs ("is the mobile schema just not running?")
and there are more metrics to add. But for the time being, is progress