Analytics November 2017

analytics@lists.wikimedia.org

24 participants
15 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Migrated Reportcard with Updated Data
by Nuria Ruiz 12 Mar '18

12 Mar '18

Hello! The Analytics team would like to announce that we have migrated the reportcard to a new domain: https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-… The migrated reportcard includes both legacy and current pageview data, daily unique devices and new editors data. Pageview and devices data is updated daily but editor data is still updated ad-hoc. The team is working at this time on revamping the way we compute edit data and we hope to be able to provide monthly updates for the main edit metrics this quarter. Some of those will be visible in the reportcard but the new wikistats will have more detailed reports. You can follow the new wikistats project here: https://phabricator.wikimedia.org/T130256 Thanks, Nuria

4 6

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

Tool to visualize which wiki pages link to which wiki pages?
by Andre Klapper 27 Nov '17

27 Nov '17

Hi, trying to improve the mess of our docs for developers on mediawiki.org, I've been wondering if anyone's aware of any visualization tool that draws a graph showing which wiki pages are linked from which other wiki pages (up to a certain depth), ignores pages which include {{Outdated}} or {{Historical}} templates, ignores pages in certain namespaces like "Talk:" or "User:", and ignores pages which are just translations (like "PageName/qqx"). Or at least some of all this. :) Thanks in advance for any ideas! andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

6 5

Undocumented project code in pagecounts-ez
by Michael Baldwin 24 Nov '17

24 Nov '17

Hi, I've been using the very helpful pagecount dumps described at: https://dumps.wikimedia.org/other/pagecounts-ez/ And it describes: Line format: wiki code (subproject.project) article title monthly total (with interpolation when data is missing) hourly counts In the wiki code field, the subproject is the language code (fr, el, ja, etc) or meta, commons etc. The project is one of b (wikibooks), k (wiktionary), n (wikinews), o (wikivoyage), q (wikiquote), s (wikisource), v (wikiversity), z (wikipedia). However, I've been coming across a large number of wiki codes "en.m". The "m" code is undocumented. It appears to be the mobile version of Wikipedia, but can anyone confirm that? Should the page be updated with this information? Thanks, Michael

6 6

Important news about Analytics databases
by Luca Toscano 22 Nov '17

22 Nov '17

Hi everybody, the Analytics team needs to make some changes to the current configuration and deployment of the Analytics databases. Before starting a little refresh to be on the same page: - db1046 - eventlogging master database - db1047 - also known as analytics-slave.eqiad.wmnet - replicates via mysql s1/s2 and the log database (on db1046) using a custom replication script. - dbstore1002 - also known as analytics-store.eqiad.wmnet and x1-analytics-slave.eqiad.wmnet - replicates most of the S shards and X1 via mysql, and the log database using a custom replication script. - db1108 (brand new host) - replicates the log database using a custom replication script. We have been suffering during the past months some space and performance issues on dbstore1002 (https://phabricator.wikimedia.org/T168303), so we came up with the following plan: - db1108, a brand new host with SSD disks, replaces db1047 and becomes the CNAME of analytics-slave.eqiad.wmnet. This new host will be a replica of the log database only, no other database will be replicated. - dbstore1002 will loose the support of the log database, that will be dropped from the host. - db1047 will eventually be decommissioned (after backing up data and alert people beforehand - T156844). This will allow us to: 1) Reduce the load on dbstore1002 and free a lot of space on the host. 2) Offer a more performant way to query eventlogging analytics data. 3) Reduce the current performance issues that we have been experiencing while trying to sanitize/purge old event-logging data (https://phabricator. wikimedia.org/T156933) The plan is the following: - November 13th: the analytics-slave CNAME moves from db1047 to db1108 - November 20th: the log database will be dropped from dbstore1002/analytics-store together with the event-logging replication script - December 4th: shutdown of db1047 (prior backup of non-log database tables) More info in https://phabricator.wikimedia.org/T156844 To summarize what will change from the users perspective: - dbstore1002 (analytics-store) will offer all the S/X shards replication (wikis) and all the databases like staging that everybody is used to work with. It will only loose the support of the log database. - db1108 will offer the log database replication and a staging database. - the db1047's (analytics-slave) staging database will be moved or copied with a different name (like staging_db1047) to dbstore1002. Please let us know in the task your opinion in T156844, we'd love to hear some feedback before proceeding, especially about extra requirements that we haven't thought of. Thanks! Luca (on behalf of the Analytics team)

2 2

Analytics maintenance windows announced for stat boxes, db1046 and thorium (analytics websites)
by Luca Toscano 21 Nov '17

21 Nov '17

Hi everybody, the Analytics team needs to do the following maintenance operations: 1) migrate the Event-Logging master db ('log', currently on db1046) to the new host db1107 (T156844). This should happen on *Wed Nov 15th (EU morning)*, and it should be transparent to all the Event Logging users. The only drawback that might be observed is a delay in getting the latest records on the analytics db replicas (db1108, db1047, dbstore1002). 2) Reboot thorium and all the stat boxes for Linux kernel updates. - Thorium hosts all the analytics websites like pivot.wikimedia.org, yarn.wikimedia.org, analytics.wikimedia.org, etc.. and will be rebooted on *Wed Nov 15th (EU morning)*, the websites downtime should be minimal (range of minutes). - stat boxes (stat1004, stat1005, stat1006) are usually running a lot of screen/tmux sessions with various data crunching activities, so I'll try to follow up with all the users currently running something on them to verify if I can proceed or not. I'd tentatively schedule the reboots on *Thu Nov 16h (EU morning)*, but please follow up with me asap if this needs to be postponed. Thanks in advance and sorry for the trouble! Luca (on behalf of the Analytics team)

1 1

Google Code-in: Get your tasks for young contributors prepared!
by Andre Klapper 18 Nov '17

18 Nov '17

Google Code-in is an annual contest for 13-17 year old students. It will take place from Nov28 to Jan17 and is not only about coding tasks. While we wait whether Wikimedia will get accepted: * You have small, self-contained bugs you'd like to see fixed? * Your documentation needs specific improvements? * Your user interface has small design issues? * Your Outreachy/Summer of Code project welcomes small tweaks? * You'd enjoy helping someone port your template to Lua? * Your gadget code uses some deprecated API calls? * You have tasks in mind that welcome some research? Also note that "Beginner tasks" (e.g. "Set up Vagrant" etc) and "generic" tasks are very welcome (e.g. "Choose & fix 2 PHP7 issues from the list in https://phabricator.wikimedia.org/T120336 "). Because we will need hundreds of tasks. :) And we also have more than 400 unassigned open 'easy' tasks listed: https://phabricator.wikimedia.org/maniphest/query/HCyOonSbFn.z/#R Would you be willing to mentor some of those in your area? Please take a moment to find / update [Phabricator etc.] tasks in your project(s) which would take an experienced contributor 2-3 hours. Check https://www.mediawiki.org/wiki/Google_Code-in/Mentors and please ask if you have any questions! For some achievements from last round, see https://blog.wikimedia.org/2017/02/03/google-code-in/ Thanks!, andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

6 16

Quick questions on the pagecount files
by Ugur Yildirim 17 Nov '17

17 Nov '17

Hi, We are three graduate students at UC Berkeley, and we are currently working on a machine learning project for a class that we’re taking. We’re using the page views data that we believe you maintain: https://dumps.wikimedia.org/other/pagecounts-raw/ <https://dumps.wikimedia.org/other/pagecounts-raw/> We have two quick questions that we were hoping you could answer: 1) We found views with a size of -1 or 0. Does this mean the page doesn’t exist? 2) We found some articles have `size` that widely varies throughout the hourly snapshots of a day. Is that legitimate, or is there something odd with the data? Thanks, Ugur

2 1

Research Showcase Wednesday, November 15, 2017 at 11:30 AM (PST) 18:30 UTC
by Sarah R 15 Nov '17

15 Nov '17

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, November 15, 2017 at 11:30 AM (PST) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=nMENRAkeHnQ As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#November_2017>. This month's presentation: Conversation Corpora, Emotional Robots, and Battles with BiasBy *Lucas Dixon (Google/Jigsaw)*I'll talk about interesting experimental setups for doing large-scale analysis of conversations in Wikipedia, and what it even means to grapple with the concept of conversation when one is talking about revisions on talk pages. I'll also describe challenges with having good conversations at scale, some of the dreams one might have for AI in the space, and I'll dig into measuring unintended bias in machine learning and what one can do to make ML more inclusive. This talk will cover work from the WikiDetox <https://meta.wikimedia.org/wiki/Research:Detox> project as well as ongoing research on the nature and impact of harassment in Wikipedia discussion spaces <https://meta.wikimedia.org/wiki/Research:Study_of_harassment_and_its_impact> – part of a collaboration between Jigsaw, Cornell University, and the Wikimedia Foundation. The ML model training code, datasets, and the supporting tooling developed as part of this project are openly available. Many kind regards, Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

3 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics November 2017