Analytics June 2015

analytics@lists.wikimedia.org

34 participants
28 discussions

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

Relevant Content Availability
by Abdel Samad, Rawia 15 Oct '15

15 Oct '15

Hello, I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it: * We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs? * Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time? * What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.) * We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping? Many Thanks, Rawia [Description: Strategy& Logo] Formerly Booz & Company Rawia Abdel Samad Direct: +9611985655 | Mobile: +97455153807 Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com> www.strategyand.com

5 6

query help
by Jon Katz 30 Jun '15

30 Jun '15

Hi Analytics, I have a query help question. Trying to run the following to see how many clicks are generated by ~3000 pages in english wikipedia. I get 0 results when I run this query, but that seems unlikely as it means nobody clicked on the mobile web's barack obama page (or any of the other pages I am looking at) during the hour in question. I am almost sure I have a stupid 'where' clause in here that is blocking me. Any thoughts? Thanks, J the table of pages I am looking at is in hive under jkatz.browse_prototype_page_list4 Here is a sample: category title url Category:20th-century American politicians Al Gore Al_Gore Category:20th-century American politicians Barbara Boxer Barbara_Boxer Category:20th-century American politicians Ben Cardin Ben_Cardin Here is my query: SELECT * from jkatz.browse_prototype_page_list4 L INNER JOIN ( SELECT count(*) AS num_pvs, referer, uri_path FROM wmf.webrequest WHERE TRUE = TRUE AND webrequest_source = 'mobile' AND year = 2015 AND month = 5 AND day = 25 AND hour = 1 AND agent_type <> 'spider' AND is_pageview = TRUE AND x_analytics_map['mf-m'] IS NULL AND access_method = 'mobile web' AND uri_host like '%en.wikipedia.org%' GROUP BY referer, uri_path ) R on R.referer = concat('http://en.m.wikipedia.org/wiki/', L.url) ORDER BY category LIMIT 50;

2 3

user table information
by Leila Zia 30 Jun '15

30 Jun '15

Hi, For the article recommendation test, we queried user table to get editors' email addresses. We then excluded the emails that were not verified. We've received a comment here <https://meta.wikimedia.org/wiki/Research_talk:Increasing_article_coverage#U…> that suggests the user has changed his/her email addresse and we have somehow retained the old email address. I'd like to get to the bottom of this problem. Can someone help with this, in the Talk page or here? Are we looking at the wrong table? And in general, how can old information be in the user table? Sorry for sending an email over the weekend. A response on Monday would be great. :-) Thank you! Leila

4 4

Re: [Analytics] [Research-Internal] Revision history of deleted pages
by Leila Zia 25 Jun '15

25 Jun '15

switching to the public list with Bob's permission. On Wed, Jun 24, 2015 at 1:58 PM, Robert West <robert.bob.west(a)gmail.com> wrote: > Hi everyone, > > I'd like to find all enwiki articles that were ever marked with the > {{hoax}} template. Pages with that template mostly end up being deleted, so > they're not available in the public revision dumps > <https://dumps.wikimedia.org/enwiki/20150602/>. > > Hence my question: > Is there a way of getting access to the full enwiki revision dump > including all deleted pages? > I don't know yet which deleted articles I'm interested in, but will only > know that after having done a pass over the full revision history. > > I know that viewing deleted content is problematic > <https://en.wikipedia.org/wiki/Wikipedia:Viewing_deleted_content> (hence > I'm sending this request to this internal research list), but I signed an > NDA and have access to data on HDFS via stat1002, so there might be a way > for me to access that data? > > I'm also aware of a list of archived hoaxes > <https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia>, > but many shorter-lived hoaxes that got deleted fast are not included there. > > Thanks -- any pointers welcome! > Bob > > > -- > Up for a little language game? -- http://www.unfun.me > > _______________________________________________ > Research-Internal mailing list > Research-Internal(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/research-internal > >

3 5

"If it didn't happen in HDFS, it didn't happen"
by Oliver Keyes 25 Jun '15

25 Jun '15

Hey all, We're building a lot of tools out on Labs. From a RESTful API to a Wikidata Query Service, we're making neat things and Labs is proving the perfect place to prototype them - in all-but-one-respects. A crucial part of these tools being not just useful but measurably useful is the logs being available to parse. When you combine that with the constraints on getting things on to do the main cluster, what you have is a situation where much of our beta or alpha software has no integration with our existing data storage systems but absolutely /needs/ it to verify that it's worth keeping and provide data about usage. So I'm asking, I guess, two things. The first is: can we have a firm commitment that we'll get this kind of stuff into Hadoop? Right now we have a RESTful API everywhere that is not (to my knowledge) throwing data into the request logs. We have a WDQS that isn't either. Undoubtedly we have other tools I haven't encountered. It's paramount that the first question we ask with new services or systems is "so when does new traffic data start hitting the analytics cluster?" Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop? -- Oliver Keyes Research Analyst Wikimedia Foundation

4 13

Monthly compressed traffic delay
by Michael Hale 25 Jun '15

25 Jun '15

Hello, I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though. I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer? Thanks, Michael

3 10

Pageview API Status update
by Dan Andreescu 25 Jun '15

25 Jun '15

I just posted a comment on the famous task: https://phabricator.wikimedia.org/T44259#1341010 :) Here it is for those who would rather discuss on this list: We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting. The data is too large to query and anonymize in real time. We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes. We have two cubes in mind and I'll explain those here. For each cube, we're aiming to have: * Direct access to a postgresql database in labs with the data * API access through RESTBase * Mondrian / Saiku access in labs for dimensional analysis * Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet) * Higher level aggregations will be pre-computed so they use all data And, the cubes are: **stats.grok.se Cube: basic pageview data** Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years. The dimensions available will be: * project - 'Project name from requests host name' * dialect - 'Dialect from requests path (not set if present in project name)' * page_title - 'Page Title from requests path and query' * access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app' * is_zero - 'accessed through a zero provider' * agent_type - 'Agent accessing the pages, can be spider or user' * referer_class - 'Can be internal, external or unknown' **Geo Cube: geo-coded pageview data** Daily resolution. Will allow researchers to track the flu, breaking news, etc. Dimensions will be: * project - 'Project name from requests hostname' * page_title - 'Page Title from requests path and query' * country_code - 'Country ISO code of the accessing agents (computed using MaxMind GeoIP database)' * province - 'State / Province of the accessing agents (computed using MaxMind GeoIP database)' * city - 'Metro area of the accessing agents (computed using MaxMind GeoIP database)' So, if anyone wants another cube, **now** is the time to speak up. We'll probably add cubes later, but it may be a while. [1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube

5 15

analtyics-store eventlogging UNION queries
by Sean Pringle 25 Jun '15

25 Jun '15

Below are a couple examples of recent slow queries running on analytics-store. They both use subqueries and UNION. Neither push the WHERE conditions down to the subqueries, which means indexes are not used. If you include WHERE clauses in both inner and outer queries you will achieve index usage, especially for these sorts of eventlogging queries on event_action or wiki. MySQL / MariaDB query optimizer does do constant propagation, but UNION tends to trip it up in all sorts of exciting ways. Be verbose, abandon DRY principle, and make your life easier :-) Current query: SELECT Month.Date, COALESCE(Web.Web, 0) AS Web FROM ( SELECT DATE_FORMAT( ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1), '%Y-%m-%d' ) AS Date FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30 ) AS Month LEFT JOIN ( SELECT DATE(timestamp) AS Date, SUM(1) AS Web FROM ( SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025 ) as MobileWebEditing WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' GROUP BY Date ) AS Web ON Month.Date = Web.Date; Version with WHERE clauses pushed down to subqueries: SELECT Month.Date, COALESCE(Web.Web, 0) AS Web FROM ( SELECT DATE_FORMAT( ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1), '%Y-%m-%d' ) AS Date FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30 ) AS Month LEFT JOIN ( SELECT DATE(timestamp) AS Date, SUM(1) AS Web FROM ( SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025 WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' ) as MobileWebEditing WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' GROUP BY Date ) AS Web ON Month.Date = Web.Date; Current query (hours): SELECT day, `Attempted saves` / `Loads` AS "All wikis", `Attempted saves - dewiki` / `Loads - dewiki` AS "dewiki", `Attempted saves - enwiki` / `Loads - enwiki` AS "enwiki", `Attempted saves - eswiki` / `Loads - eswiki` AS "eswiki", `Attempted saves - frwiki` / `Loads - frwiki` AS "frwiki", `Attempted saves - ptwiki` / `Loads - plwiki` AS "plwiki", `Attempted saves - plwiki` / `Loads - ptwiki` AS "ptwiki" FROM ( SELECT DATE( timestamp ) AS day, SUM( IF( event_action = 'ready', 1, 0 ) ) AS "Loads", SUM( IF( event_action = 'ready' AND wiki = 'dewiki', 1, 0 ) ) AS "Loads - dewiki", SUM( IF( event_action = 'ready' AND wiki = 'enwiki', 1, 0 ) ) AS "Loads - enwiki", SUM( IF( event_action = 'ready' AND wiki = 'eswiki', 1, 0 ) ) AS "Loads - eswiki", SUM( IF( event_action = 'ready' AND wiki = 'frwiki', 1, 0 ) ) AS "Loads - frwiki", SUM( IF( event_action = 'ready' AND wiki = 'plwiki', 1, 0 ) ) AS "Loads - plwiki", SUM( IF( event_action = 'ready' AND wiki = 'ptwiki', 1, 0 ) ) AS "Loads - ptwiki", SUM( IF( event_action = 'saveAttempt', 1, 0 ) ) AS "Attempted saves", SUM( IF( event_action = 'saveAttempt' AND wiki = 'dewiki', 1, 0 ) ) AS "Attempted saves - dewiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'enwiki', 1, 0 ) ) AS "Attempted saves - enwiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'eswiki', 1, 0 ) ) AS "Attempted saves - eswiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'frwiki', 1, 0 ) ) AS "Attempted saves - frwiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'plwiki', 1, 0 ) ) AS "Attempted saves - plwiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'ptwiki', 1, 0 ) ) AS "Attempted saves - ptwiki" FROM ( SELECT event_action, wiki, timestamp FROM Edit_10604157 UNION ALL SELECT event_action, wiki, timestamp FROM Edit_10676603 ) AS valid_edit_schema_revisions GROUP BY day ORDER BY day ) as internalQuery; Version with WHERE clauses pushed down to subqueries: SELECT day, `Attempted saves` / `Loads` AS "All wikis", `Attempted saves - dewiki` / `Loads - dewiki` AS "dewiki", `Attempted saves - enwiki` / `Loads - enwiki` AS "enwiki", `Attempted saves - eswiki` / `Loads - eswiki` AS "eswiki", `Attempted saves - frwiki` / `Loads - frwiki` AS "frwiki", `Attempted saves - ptwiki` / `Loads - plwiki` AS "plwiki", `Attempted saves - plwiki` / `Loads - ptwiki` AS "ptwiki" FROM ( SELECT DATE( timestamp ) AS day, SUM( IF( event_action = 'ready', 1, 0 ) ) AS "Loads", SUM( IF( event_action = 'ready' AND wiki = 'dewiki', 1, 0 ) ) AS "Loads - dewiki", SUM( IF( event_action = 'ready' AND wiki = 'enwiki', 1, 0 ) ) AS "Loads - enwiki", SUM( IF( event_action = 'ready' AND wiki = 'eswiki', 1, 0 ) ) AS "Loads - eswiki", SUM( IF( event_action = 'ready' AND wiki = 'frwiki', 1, 0 ) ) AS "Loads - frwiki", SUM( IF( event_action = 'ready' AND wiki = 'plwiki', 1, 0 ) ) AS "Loads - plwiki", SUM( IF( event_action = 'ready' AND wiki = 'ptwiki', 1, 0 ) ) AS "Loads - ptwiki", SUM( IF( event_action = 'saveAttempt', 1, 0 ) ) AS "Attempted saves", SUM( IF( event_action = 'saveAttempt' AND wiki = 'dewiki', 1, 0 ) ) AS "Attempted saves - dewiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'enwiki', 1, 0 ) ) AS "Attempted saves - enwiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'eswiki', 1, 0 ) ) AS "Attempted saves - eswiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'frwiki', 1, 0 ) ) AS "Attempted saves - frwiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'plwiki', 1, 0 ) ) AS "Attempted saves - plwiki", SUM( IF( event_action = 'saveAttempt' AND wiki = 'ptwiki', 1, 0 ) ) AS "Attempted saves - ptwiki" FROM ( SELECT event_action, wiki, timestamp FROM Edit_10604157 WHERE event_action in ('saveAttempt', 'ready') UNION ALL SELECT event_action, wiki, timestamp FROM Edit_10676603 WHERE event_action in ('saveAttempt', 'ready') ) AS valid_edit_schema_revisions GROUP BY day ORDER BY day ) as internalQuery; -- DBA @ WMF

2 1

rsync from stat1002 broken
by Oliver Keyes 23 Jun '15

23 Jun '15

File modifications on stat1002 haven't been reflected in http://datasets.wikimedia.org/aggregate-datasets/search/ for ~4 hours; the awful timing of this means it's effectively broken the search dashboards (we were in the middle of a format switchover when it happened) This needs fixing, hyper-urgently; we cannot display any data without the ability to /retrieve/ that data. For an example, compare stat1002::/a/aggregate-datasets/search/app_event_counts.tsv to http://datasets.wikimedia.org/aggregate-datasets/search/app_event_counts.tsv -- Oliver Keyes Research Analyst Wikimedia Foundation

3 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics June 2015