Analytics September 2017

analytics@lists.wikimedia.org

5 participants
7 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Migrated Reportcard with Updated Data
by Nuria Ruiz 12 Mar '18

12 Mar '18

Hello! The Analytics team would like to announce that we have migrated the reportcard to a new domain: https://analytics.wikimedia.org/dashboards/reportcard/#pageviews-july-2015-… The migrated reportcard includes both legacy and current pageview data, daily unique devices and new editors data. Pageview and devices data is updated daily but editor data is still updated ad-hoc. The team is working at this time on revamping the way we compute edit data and we hope to be able to provide monthly updates for the main edit metrics this quarter. Some of those will be visible in the reportcard but the new wikistats will have more detailed reports. You can follow the new wikistats project here: https://phabricator.wikimedia.org/T130256 Thanks, Nuria

4 6

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

Fwd: [Cloud] [Cloud-announce] Downtime for select VMs next week, 2017-10-04
by Greg Grossmeier 29 Sep '17

29 Sep '17

FYI The deployment-kafka-jumbo-1 VM in Beta Cluster will be offline for a while next week. Please be aware and take appropriate action. Greg ----- Forwarded message from Andrew Bogott <abogott(a)wikimedia.org> ----- > Date: Fri, 29 Sep 2017 15:57:54 -0500 > From: Andrew Bogott <abogott(a)wikimedia.org> > To: Cloud-announce(a)lists.wikimedia.org > Subject: [Cloud] [Cloud-announce] Downtime for select VMs next week, 2017-10-04 > Reply-To: cloud(a)lists.wikimedia.org > > In order to rebuild a server of questionable stability, I'm going to move > the following instances on Wednesday: > > |+--------------------------+---------------------+--------+|| > ||| Name | Tenant ID | Status | || > ||+--------------------------+---------------------+--------+|| > ||| cindy | pluggableauth | ACTIVE | || > ||| deployment-kafka-jumbo-1 | deployment-prep | ACTIVE | || > ||| oidc-google | pluggableauth | ACTIVE | || > ||| proton-staging | reading-web-staging | ACTIVE | || > ||| search-jessie | search | ACTIVE | || > ||| smtp-test1 | project-smtp | ACTIVE | || > ||| suggestbot-prod | suggestbot | ACTIVE | || > ||| twlight-prod | twl | ACTIVE | || > ||| twlight-staging | twl | ACTIVE | |||| > ||| wikibrain-embeddings-02 | wikibrain | ACTIVE | || > ||| wikikids | wmam | ACTIVE | || > ||| zim-proto | mobile | ACTIVE | || > ||+--------------------------+---------------------+--------+| > > Migration will cause the affected instances to be offline for some time > (potentially more than an hour depending on the size of the instance) and > rebooted. If you need me work on your server at a particular time of day, > or need a stay of execution, please let me know. Otherwise I'll start going > down the list at the beginning of my workday on Wednesday, around 14:00 UTC. > > Sorry for the inconvenience! > > -Andrew > > > _______________________________________________ > Cloud-announce mailing list > Cloud-announce(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/cloud-announce > _______________________________________________ > Cloud mailing list > Cloud(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/cloud ----- End forwarded message ----- -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |

1 0

Research Showcase Wednesday, September 20, 2017 at 11:30 AM (PST) 18:30 UTC
by Sarah R 18 Sep '17

18 Sep '17

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, September 20, 2017 at 11:30 AM (PST) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=VR5JwqyVGSk As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2017>. This month's presentation: A Glimpse into BabelAn Analysis of Multilinguality in WikidataBy *Lucie-Aimée Kaffee*Multilinguality is an important topic for knowledge bases, especially Wikidata, that was build to serve the multilingual requirements of an international community. Its labels are the way for humans to interact with the data. In this talk, we explore the state of languages in Wikidata as of now, especially in regard to its ontology, and the relationship to Wikipedia. Furthermore, we set the multilinguality of Wikidata in the context of the real world by comparing it to the distribution of native speakers. We find an existing language maldistribution, which is less urgent in the ontology, and promising results for future improvements. An outlook on how users interact with languages on Wikidata will be given. Science is Shaped by WikipediaEvidence from a Randomized Control TrialBy *Neil C. Thompson and Douglas Hanley*As the largest encyclopedia in the world, it is not surprising that Wikipedia reflects the state of scientific knowledge. However, Wikipedia is also one of the most accessed websites in the world, including by scientists, which suggests that it also has the potential to shape science. This paper shows that it does. Incorporating ideas into a Wikipedia article leads to those ideas being used more in the scientific literature. This paper documents this in two ways: correlationally across thousands of articles in Wikipedia and causally through a randomized experiment where we added new scientific content to Wikipedia. We find that fully a third of the correlational relationship is causal, implying that Wikipedia has a strong shaping effect on science. Our findings speak not only to the influence of Wikipedia, but more broadly to the influence of repositories of scientific knowledge. The results suggest that increased provision of information in accessible repositories is a very cost-effective way to advance science. We also find that such gains are equity-improving, disproportionately benefitting those without traditional access to scientific information. Many kind regards, Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

1 0

[wdqs_extract] Failed to fetch wmf.wdqs_extract entries 23.08.2017 hour 9-17
by Adrian Bielefeldt 10 Sep '17

10 Sep '17

Hello everyone, I'm having trouble with the entries in wmf.wdqs_extract on 23.08.2017, specifically the hours 9-17. The call hive -e "insert overwrite local directory 'temp' row format delimited fields terminated by '\t' select uri_query, uri_path, user_agent, ts, agent_type, hour, http_status from wmf.wdqs_extract where uri_query<>\"\" and year='2017' and month='8' and day='23' and hour='8'" works fine, as do all hours in 0-8 and 18-23, but if I change hour to anything between 9 and 17 it fails with the following message: Error: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:449) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:421) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 9 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38) ... 14 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 17 more Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:147) ... 22 more Caused by: java.lang.RuntimeException: Hive internal error: conversion of string to array<string>not supported yet. at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.<init>(ObjectInspectorConverters.java:313) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:158) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:374) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:155) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.<init>(ObjectInspectorConverters.java:374) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.getConverter(ObjectInspectorConverters.java:155) at org.apache.hadoop.hive.ql.exec.MapOperator.initObjectInspector(MapOperator.java:199) at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:355) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:116) ... 22 more Can anybode help me out with this? Greetings, Adrian

2 3

stat1002 and stat1003 deprecated. Please use new stat boxes
by Andrew Otto 05 Sep '17

05 Sep '17

Hi all! tl;dr: Stop using stat100[23] by September 1st. We’re finally replacing stat1002 and stat1003. These boxes are out of warranty, and are running Ubuntu Trusty, while most of the production fleet is already on Debian Jessie or even Debian Stretch. stat1005 is the new stat1002 replacement. If you have access to stat1002, you also have access to stat1005. I’ve copied over home directories from stat1002. stat1006 is the new stat1003 replacement. If you have access to stat1003, you also have access to stat1006. I’ve copied over home directories from stat1003. I have not migrated any personal cron jobs running on stat1002 or stat1003. I need your help for this! Both of these boxes are running Debian Stretch. As such, packages that your work depends on may have upgraded. Please log into the new boxes and try stuff out! If you find anything that doesn’t work, please let me know by commenting on https://phabricator.wikimedia.org/T152712. Please be fully migrated to the new nodes by September 1st. This will give us enough time to fully decommission stat1002 and stat1003 by the end of this quarter. I’ve only done a single rsync of home directories. If there is new data on stat1002 or stat1003 that you want rsynced over, let me know on the ticket. A few notes: - stat1002 used to have /a. This has been removed in favor of /srv. /a no longer exists. - Home directories are now much larger. You no longer need to create personal directories in /srv. - /tmp is still small, so please be careful. If you are running long jobs that generate temporary data, please have those jobs write into your home directory, rather than /tmp. - We might implement user home directory quotas in the future. Thanks all! I’ll send another email in about a months time to remind you of the impending deadline of Sept 1. -Andrew Otto

3 7

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics September 2017