Analytics August 2018

analytics@lists.wikimedia.org

6 participants
11 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 6 months

stats.wikimedia.org maintenance downtime

by Andrew Otto

Hi all, On Wednesday September 5th at around 13:30 UTC we will be taking stats.wikimedia.org and analytics.wikimedia.org offline for a server upgrade. We expect this downtime to take about an hour. You can follow along (and report any issues) at https://phabricator.wikimedia.org/T192641. Thanks! -Andrew Otto Systems Engineer Wikimedia Foundation

5 years, 7 months

Internal analytics tools upgrades (Superset, Turnilo, Hue)

by Andrew Otto

Hello all! We are in the process of upgrading the analytics infrastructure to Debian Stretch. Along the way, we will be moving (superset|turnilo|yarn|hue). wikimedia.org to new VMs. Superset and Turnilo will be upgraded as well. I plan to move hue and yarn today or tomorrow, and superset and turnilo either this week or next. The move should be transparent to you all (you might have to re-log in). Just in case, if you encounter any issues please report them here: https://phabricator.wikimedia.org/T202011 Thanks! - Andrew Otto Systems Engineer, WMF

5 years, 8 months

Archiva moves to a new host and gets upgraded to 2.2.3

by Luca Toscano

Hi everybody, if you are not a Archiva user (https://archiva.wikimedia.org/) you can stop reading this email. Tomorrow morning EU time I am going to move archiva.wikimedia.org to a new host, as explained in details in T192639. Important changes: - Archiva gets upgraded to the latest upstream version, 2.2.3 (4yrs of upstream development from 2.0.0, the current one) - The archiva-deploy user will not be active anymore, and people belonging to the (new) archiva-deployers LDAP group will be able to log in with their (LDAP) credentials and get the same permissions in archiva (for example to upload jars, etc..). - The admin user (used by the SRE team) will not be needed anymore since anybody belonging to the 'ops' LDAP group will be able to log in and have the same permissions. I already added some people to archiva-deployers (all the ones that I knew had worked in the past on it), but if you want to be sure to be on it please ping me on IRC or comment on T192639. Luca

5 years, 8 months

Cron jobs running on Analytics stat hosts and firewall rules

by Luca Toscano

Hi everybody, as part of T198623 the Analytics and Traffic team worked on a better set of firewall rules for ipv4/ipv6 traffic generated within the Analytics VLAN and going towards Production. For example, we are now enforcing the usage of https://wikitech.wikimedia.org/wiki/HTTP_proxy for all the http/https connections originated from the Analytics VLAN, so if you have any important cron job that runs periodically on any Analytics host (most likely the stat boxes) please check that it complies to this policy as soon as possible. Please note that the policy itself is not new ( https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Stats_machines) but it will be enforced very soon during the next couple of days. We have run several tcpdump sessions to check the current traffic (we are reasonably sure that nothing will break), but better safe than sorry :) For any comment/suggestion/question/etc.. please follow up in the task or with me in the #wikimedia-analytics IRC channel. Thanks in advance! Luca (on behalf of the Analytics team)

5 years, 8 months

Fwd: [Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

by Pine W

More changes are coming for dumps, this time for Hungarian Wikipedia (approximately 436,000 articles) and Arabic Wikipedia.(approximately 595,000 articles). Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message --------- From: Ariel Glenn WMF <ariel(a)wikimedia.org> Date: Mon, Aug 20, 2018 at 10:27 AM Subject: [Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org>, Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Starting September 1, huwiki and arwiki, which both take several days to complete the revsion history content dumps, will be moved to the 'big wikis' list, meaning that they will run jobs in parallel as do frwiki, ptwiki and others now, for a speedup. Please update your scripts accordingly. Thanks! Task for this: https://phabricator.wikimedia.org/T202268 Ariel _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

5 years, 8 months

SWAP (Jupyter Notebooks) now supports Spark

by Andrew Otto

Hi all! I’d like to announce that we’ve done a bit of work to make Jupyter Notebooks in SWAP <https://wikitech.wikimedia.org/wiki/SWAP> support Spark kernels. This means that you can now run Spark shells in both local mode (on the notebook server) or YARN mode (distributed on the Hadoop Cluster) inside of a Jupyter notebook. You can then take advantage of fancy Jupyter plotting libraries to make graphs directly from data in Spark. See https://wikitech.wikimedia.org/wiki/SWAP#Spark for documentation. This is a new feature, and I’m sure there will be kinks to work out. If you encounter issues of have questions, please respond on this phabricator ticket <https://phabricator.wikimedia.org/T190443>, or create a new one and add the Analytics tag. Enjoy! -Andrew Otto & Analytics Engineering

5 years, 8 months

Wikimedia Research Showcase August 13 2018 at 11:30 AM (PDT) 18:30 UTC

by Sarah R

Hi Everyone, The next Wikimedia Research Showcase will be live-streamed Wednesday, August 13 2018 at 11:30 AM (PDT) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=OGPMS4YGDMk As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here. <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Upcoming_Showcase> Hope to see you there! This month's presentations is: *Quicksilver: Training an ML system to generate draft Wikipedia articles and Wikidata entries simultaneously* John Bohannon and Vedant Dharnidharka, Primer The automatic generation and updating of Wikipedia articles is usually approached as a multi-document summarization task: Given a set of source documents containing information about an entity, summarize the entity. Purely sequence-to-sequence neural models can pull that off, but getting enough data to train them is a challenge. Wikipedia articles and their reference documents can be used for training, as was recently done <https://arxiv.org/abs/1801.10198> by a team at Google AI. But how do you find new source documents for new entities? And besides having humans read all of the source documents, how do you fact-check the output? What is needed is a self-updating knowledge base that learns jointly with a summarization model, keeping track of data provenance. Lucky for us, the world’s most comprehensive public encyclopedia is tightly coupled with Wikidata, the world’s most comprehensive public knowledge base. We have built a system called Quicksilver uses them both.

5 years, 8 months

Fwd: [Wikitech-l] MultiContent Revisions and changes to the XML dumps

by Pine W

Forwarding in case this is of interest to people on the Analytics or Research lists who don't subscribe to Wikitech-l or Xmldatadumps-l. Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message ---------- From: Ariel Glenn WMF <ariel(a)wikimedia.org> Date: Thu, Aug 2, 2018 at 2:40 PM Subject: [Wikitech-l] MultiContent Revisions and changes to the XML dumps To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org>, Wikimedia developers <wikitech-l(a)lists.wikimedia.org> As many of you may know, MultiContent Revisions are coming soon (October?) to a wiki near you. This means that we need changes to the XML dumps schema; these changes will likely NOT be backwards compatible. Initial discussion will take place here: https://phabricator.wikimedia.org/T199121 For background on MultiContent Revisions and their use on e.g. Commons or WikiData, see: https://phabricator.wikimedia.org/T200903 (Commons media medata) https://phabricator.wikimedia.org/T194729 (Wikidata entites) https://www.mediawiki.org/wiki/Requests_for_comment/Multi-Content_Revisions (MCR generally) There may be other, better tickets/pages for background; feel free to supplement this list if you have such links. Ariel _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

5 years, 8 months

EventStreams goes multi-datacenter on Monday August 6

by Andrew Otto

Hi all, tl;dr On Monday August 6 we are making EventStreams multi-DC, and this should be transparent to users. Due to a recent outage <https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-e…> of the our main eqiad Kafka cluster, we want to make the EventStreams service support multiple datacenters for better high availability. To do so, we need to hide the Kafka cluster message offsets from the SSE/EventSource clients. On Monday August 6th, we will deploy a change to EventStreams that will make it use message timestamps instead of message offsets in the SSE/EventSource id field that is returned for every received message. This will allow EventStreams to be backed by any Kafka cluster, with auto-resuming during reconnect based on timestamp instead of Kafka cluster based logical offsets. This deployment should be transparent to clients. SSE/EventSource clients will reconnect automatically and begin to use timestamps instead of offsets in the Last-Event-ID. You can read more about this work here: https://phabricator.wikimedia.org/T199433 - Andrew Otto, Systems Engineer, WMF

5 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics August 2018