- Analytics-announce - lists.wikimedia.org

by Andrew Otto

Hi all! tl;dr newly created files in HDFS will not be 'world' readable by default. I.e. you must be either the owner or in the file's group to read the file. Today we changed the default HDFS umask to 027, so that all new files and dirs will have either 640 (rw-r-—--) (for files) or 750 (rwx-r-x-—) for directories. We don't anticipate any problems, but if you encounter any please don't hesitate to let us know. You can always hdfs dfs -chmod <https://hadoop.apache.org/docs/r2.7.6/hadoop-project-dist/hadoop-common/Fil…> your files after you create them. See https://phabricator.wikimedia.org/T270629 for more info. Your friendly Hadoop operators, - Andrew & Razzi & Luca

3 years, 3 months

1
0
0 0

Upcoming maintenance for stat100x hosts: new kernel, shared kerberos creds with notebooks and analytics-hive.eqiad.wmnet

by Luca Toscano

Hi everybody, On Monday 21st we'd like to reboot all stat100x hosts for Linux kernel upgrades at around 9 AM CET. This means that all the notebooks and various activities running on those nodes will be stopped for a brief amount of time. To repay your patience, two things will be added: - A shared kerberos credential cache with notebooks. This practically means that you will only be required to kinit once (either after doing ssh to stat100x or in a Jupyter notebook), and the credentials will be shared (no more double kinit etc..). It is already "live" on stat1004 if you want to test it! Since the new shared credential will have a new location on disk, all kerberos sessions will be destroyed and you'll have to kinit again when the reboots are completed. More details in https://phabricator.wikimedia.org/T255262. - A new endpoint for Hive called 'analytics-hive.eqiad.wmnet', that should replace hive jdbc/metastore configs hardcoding an-coord1001.eqiad.wmnet (and allow us to failover transparently if needed without requesting job restarts etc..). The side effect of this is that all hive-related tools will change configs (transparently for external users). If you have any script that points directly to hive via JDBC (for example a Python script using PyHive etc..) please update it with the new endpoint. If this schedule impacts your work, please ping me via email/IRC/etc.. and I'll try to reschedule accordingly :) Thanks! Luca (on behalf of Analytics / Data Engineering)

3 years, 4 months

2
3
0 0

Superset maintenance scheduled for December 22

by Razzi Abuissa

Hi all, On Tuesday, December 22, from 15-16 UTC (10-11am EST, 7-8am PST), superset.wikimedia.org will be offline to upgrade the hardware and add caching as part of https://phabricator.wikimedia.org/T268219. When the upgrade is complete, by default, charts will be cached for 12 hours. As you can see in the following screenshot, you can view whether a chart is cached from the overflow menu, and you'll have the option to force refresh it. [image: Screen Shot 2020-12-16 at 11.40.51 AM.png] The time that any given chart will be cached is configurable via the "edit chart" menu item. For example, set the cache timeout to 3600 seconds for data to cache for an hour. [image: image.png] Reply to this email or reach out to razzi or the #wikimedia-analytics IRC channel if you have any questions or concerns about this migration. As always, the maintenance schedule can be viewed here <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule> . Regards, Razzi

3 years, 4 months

1
0
0 0

Simplification of Analytics POSIX groups and access requests

by Luca Toscano

Hi everybody, The Analytics team is trying to simplify the access request process to the stat100x clients to avoid, as much as possible, confusion for the user requesting access or for the SRE reviewing the access request. The following is happening: * analytics-users and researchers POSIX group are being deprecated in https://phabricator.wikimedia.org/T269150 and https://phabricator.wikimedia.org/T268801. They are used only by few users and they are not needed anymore nowadays. To be clear, we are not trying to deprecate the Research team, we love them :) * analytics-privatedata-users becomes the standard POSIX group to access the stat100x hosts and the Hadoop cluster. A user will be able to require only membership to the group (granting access to the stat100x hosts plus some PII data like the one on Mariadb Wiki-replicas etc..) or also to request the additional Kerberos account, to have access to Hadoop's PII data too (and compute power). The main idea is to shift the focus of the user requesting access to the fact that they will be exposed to PII data in some form, so careful steps will need to be taken (see https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibili… ). As always, feedback and suggestions are welcome! Luca (Analytics team)

3 years, 4 months

1
0
0 0

Downtime for stat1004 - Nov 25th 16 CET

by Luca Toscano

Hi everybody, I'd need to shutdown stat1004 tomorrow (Nov 25th) at around 16:00 CET to allow SRE to move it to another rack (physical move in the datacenter). The downtime should be minimal (max half an hour). Please let me know if this impacts your work! Also added to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule Luca (on behalf of the Analytics team)

3 years, 5 months

1
0
0 0

Downtime of stat1005 and stat1008 on Monday 5th for RAM expansion

by Luca Toscano

Hi everybody, on Monday 5th there will be some downtime of stat1005 and stat1008 (hopefully one hour max in total) to expand their RAM to 1.5TB (!!!!). The maintenance is scheduled to start around 16 CEST. As always, please let us know if this impacts your work! Luca (on behalf of the Analytics team)

3 years, 5 months

3
8
0 0

Maintenance on stat1008 on Monday, 2020-11-16

by Tobias Klausmann

Hi! Next Monday, 2020-11016, I will be doing some maintenance on stat1008 in the EU/CET morning. During this, there will be disruption of everything there and there will be multiple reboots. Afterwards, the machine will be running a newer kernel (5.8) and updated GPU drivers/rocm library (3.8). This is the same update as the one I did the week before last, on stat1005. If you have any questions or concerns, let us know. Best, Tobias -- Tobias Klausmann, SRE, Wikimedia Foundation

3 years, 5 months

1
1
0 0

Webrequest table change on Monday, November 23rd

by Joseph Allemandou

Hi Data Folks, *TL;DR:* We plan to update the wmf.webrequest on on Monday, November 23rd with this change <https://gerrit.wikimedia.org/r/c/analytics/refinery/+/638086> - Please get in touch on this task <https://phabricator.wikimedia.org/T267008> if you run hive queries taking advantage of the TABLESAMPLE feature on this table. *Why?* Testing the changes, we have seen: - more than 15% of global CPU time gain per computed partition, saving more than 300 hours of CPU per month. - Clock-wall time of webrequest load job divided by almost two (when the cluster is not busy) - Decrease of disk and network usage through smaller data to shuffle-sort. We cut in two of the amount of data to be written/sent/read. *What changes?* The change visible to users of the table is the increase of the number of buckets by which the table is bucketed, from 64 to 256. This means that for any leaf partition (webrequest_source, year, month, day, hour - actual folders where data files are stored), there will be 256 files instead of 64. The bucketing strategy won't change, meaning that the shuffling of rows between the files will still be done using the (hostname, sequence) fields pair in that order. Changes invisible to users are improvements in the hive query loading/augmenting the data into the partitions. *How does the change impact users?* We plan to drop the table (the structure, not the data!) and recreate it with the new bucketing number, re-adding existing partitions. This drop-recreate should go unnoticed as it is fast to execute. As new data flows in and old data is deleted, it will take 3 month for the whole table to be converted. During those three month, partitions containing 64 files will still be usable, but the queries taking advantage of buckets through the TABLESAMPLE feature will be broken for those partitions. Don't hesitate to reach out if you have questions :) -- Joseph Allemandou (joal) (he / him) on behalf of the Analytics-Engineering team Staff Data Engineer Wikimedia Foundation

3 years, 5 months

1
0
0 0

Temporary Hive/Presto/Oozie/Airflow downtime on Nov 3rd - 15->17 CET

by Luca Toscano

Hi everybody, we are going to expand the available RAM on an-coord1001, the host in the Analytics infrastructure that runs Hive/Presto/Oozie/Airflow. The procedure should last 30/40 minutes in the optimistic case, and it will involve shutting down the host (hence all daemons running on it) to allow SRE to install the new RAM modules. As always please reach out to us if this impacts your work. Thanks in advance, Luca (on behalf of the Analytics team)

3 years, 5 months

1
0
0 0

Maintenance on stat1005 on Friday, 2020-10-30

by Tobias Klausmann

Hi! > This Friday, 2020-10-30, I will be doing some maintenance on stat1005 in the > EU/CET morning. During this, there will be disruption of everything there and > there will be multiple reboots. Afterwards, the machine will be running a newer > kernel (5.8) and updated GPU drivers/rocm library (3.8). Should the update > fail, or the subsequent tests show that workloads break, we will roll back to > 4.19 and rocm33. stat1005 is now running kernel 5.8.0 and rocm38. Note that you will have to update tf-rocm to the latest version (2.3.1) to work on this machine. If you have any questions or concerns, let us know. Best, Tobias -- Tobias Klausmann, SRE, Wikimedia Foundation

3 years, 5 months

1
0
0 0

2024

2023

2022

2021

2020

Analytics-announce