Analytics-announce May 2021

analytics-announce@lists.wikimedia.org

2 participants
4 discussions

Planned hadoop maintenance Tuesday May 25
by Razzi Abuissa 18 May '21

18 May '21

Hi all, We will be draining the hadoop cluster of jobs Tuesday May 25 starting at 15:00 UTC (8am PDT) to update the operating system of the namenodes. Maintenance should last less than 2 hours, and we'll announce when the queue is back to accepting jobs. View planned maintenance here <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule>. Reply to this email or comment on the task at https://phabricator.wikimedia.org/T278423 if you have any questions or concerns. This maintenance could be done without downtime but I'm opting to drain the cluster to make for an easier recovery in case anything goes wrong. Regards, Razzi & the Data Engineering team

1 0

Removing extraneous Debian packages from Analytics Cluster
by Andrew Otto 18 May '21

18 May '21

Hello! tl;dr we will be removing Python and other Debian packages installed for ad-hoc usage. https://phabricator.wikimedia.org/T275786 Now that we've got conda environments <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda>, and the ability to ship them to worker nodes <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#PySpark_and_w…>, it should be much easier for Analytics Cluster users to install and use the dependencies they need. Previously, when someone needed a python (or other) dependency, they'd file a request in Phabricator(e.g. T197896 <https://phabricator.wikimedia.org/T197896>) and Analytics team SREs would either install an existent Python Debian package, or figure out how to build a Debian package for it. There are numerous <https://github.com/wikimedia/puppet/tree/production/modules/profile/manifes…> Python and other Debian packages installed on stat boxes and across the Hadoop workers. We'd like to stop maintaining these and remove them. We'll be doing so over the next few weeks. If you run into any issues with Python dependencies, let us know here <https://phabricator.wikimedia.org/T275786>. Thanks! -Andrew Otto SRE, Data Engineering

1 0

Hive event table sanitization changes
by Andrew Otto 12 May '21

12 May '21

Hello! We are trying to standardize the way we sanitize and retain event data in Hive. For now, nothing will change for instrumentation data. What is changing is that we are going to apply the same sanitization process to all tables in the Hive event database, and then drop all data older than 90 days from all tables in the event database. For analytics/instrumentation event tables, nothing is changing. If you need to keep data longer than 90 days, you will need to add an entry to the event_sanitized_analytics allowlist <https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/he…> (this was previously named eventlogging/whitelist.yaml) as described here <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization#Al…> . For main / production event tables (e.g. mediawiki_revision_create, we are now copying this data into the event_sanitized database. Tables to be copied are listed in the event_sanitized_main allowlist <https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/he…> . We will soon begin applying the same purging policy to all tables in the event database. When that happens, main / production event tables in the event database will no longer have data older than 90 days. If you need to query data older than 90 days for these tables, you will find it in the event_sanitized database. In this way, all event table sanitization and retention in Hive is done in the same way. Docs have been updated; you can read more here: - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Data_retention -Andrew Otto SRE, Data Engineering

1 1

Jupyter+Conda (AKA Newpyter) is ready!
by Andrew Otto 03 May '21

03 May '21

Hello! tl;dr: We'd like to turn off Jupyter+Virtualenv (SWAP) in favor of Jupyter+Conda (Newpyter) the week of May 3rd. Please help us test and switch before then. Over the last year, we've slowly been working on a replacement of the current virtualenv based JupyterHub system (formerly known as SWAP) with a new one based on Conda <https://docs.conda.io/en/latest/> (AKA Newpyter). Everything should be in place to switch and decommission the virtualenv based system you all are used to. Before we do...we have to make sure you all use and are ok with the new setup! We'd like to decommission Jupyter+Virtualenv (running on port 8000) the week of May 3rd. In the meantime, please switch to Jupyter+Conda on port 8880. The documentation has been updated. <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter> Summary of the changes: - You will ssh tunnel to port 8880 <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Access> instead of port 8000. - Your Notebook files will remain unchanged. - Your local data files will remain unchanged. - Your Python environment will change, so you may need to re-install packages. See docs here <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Conda_environ…> and here <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda#Installing_p…> . - PySpark, Scala Spark and Spark SQL and Spark-R kernels will be removed. If you use the PySpark kernels currently, please port them to a regular Python kernel using wmfdata-python to launch your SparkSession. Docs here <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#PySpark>. Please reach out with any questions, and report issues on this ticket <https://phabricator.wikimedia.org/T224658>. If we encounter any blockers along the way, we will postpone the May 3rd deadline. Thank you! - Andrew Otto + Data Engineering

1 2

2024

2023

2022

2021

2020

Analytics-announce May 2021