Hi everybody,
I'd need to shutdown stat1004 tomorrow (Nov 25th) at around 16:00 CET to
allow SRE to move it to another rack (physical move in the datacenter). The
downtime should be minimal (max half an hour).
Please let me know if this impacts your work!
Also added to
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule
Luca (on behalf of the Analytics team)
Hi everybody,
on Monday 5th there will be some downtime of stat1005 and stat1008
(hopefully one hour max in total) to expand their RAM to 1.5TB (!!!!). The
maintenance is scheduled to start around 16 CEST.
As always, please let us know if this impacts your work!
Luca (on behalf of the Analytics team)
Hi!
Next Monday, 2020-11016, I will be doing some maintenance on stat1008 in the
EU/CET morning. During this, there will be disruption of everything there and
there will be multiple reboots. Afterwards, the machine will be running a newer
kernel (5.8) and updated GPU drivers/rocm library (3.8). This is the same update
as the one I did the week before last, on stat1005.
If you have any questions or concerns, let us know.
Best,
Tobias
--
Tobias Klausmann, SRE, Wikimedia Foundation
Hi Data Folks,
*TL;DR:* We plan to update the wmf.webrequest on on Monday, November 23rd
with this change
<https://gerrit.wikimedia.org/r/c/analytics/refinery/+/638086> - Please get
in touch on this task <https://phabricator.wikimedia.org/T267008> if you
run hive queries taking advantage of the TABLESAMPLE feature on this table.
*Why?*
Testing the changes, we have seen:
- more than 15% of global CPU time gain per computed partition, saving
more than 300 hours of CPU per month.
- Clock-wall time of webrequest load job divided by almost two (when the
cluster is not busy)
- Decrease of disk and network usage through smaller data to
shuffle-sort. We cut in two of the amount of data to be written/sent/read.
*What changes?*
The change visible to users of the table is the increase of the number of
buckets by which the table is bucketed, from 64 to 256. This means that for
any leaf partition (webrequest_source, year, month, day, hour - actual
folders where data files are stored), there will be 256 files instead of
64. The bucketing strategy won't change, meaning that the shuffling of rows
between the files will still be done using the (hostname, sequence) fields
pair in that order.
Changes invisible to users are improvements in the hive query
loading/augmenting the data into the partitions.
*How does the change impact users?*
We plan to drop the table (the structure, not the data!) and recreate it
with the new bucketing number, re-adding existing partitions.
This drop-recreate should go unnoticed as it is fast to execute. As new
data flows in and old data is deleted, it will take 3 month for the whole
table to be converted. During those three month, partitions containing 64
files will still be usable, but the queries taking advantage of buckets
through the TABLESAMPLE feature will be broken for those partitions.
Don't hesitate to reach out if you have questions :)
--
Joseph Allemandou (joal) (he / him) on behalf of the Analytics-Engineering
team
Staff Data Engineer
Wikimedia Foundation
Hi everybody,
we are going to expand the available RAM on an-coord1001, the host in the
Analytics infrastructure that runs Hive/Presto/Oozie/Airflow. The procedure
should last 30/40 minutes in the optimistic case, and it will involve
shutting down the host (hence all daemons running on it) to allow SRE to
install the new RAM modules.
As always please reach out to us if this impacts your work.
Thanks in advance,
Luca (on behalf of the Analytics team)