Hi all,
We will be draining the hadoop cluster of jobs Tuesday May 25 starting at
15:00 UTC (8am PDT) to update the operating system of the namenodes.
Maintenance should last less than 2 hours, and we'll announce when the
queue is back to accepting jobs.
View planned maintenance here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule>.
Reply to this email or comment on the task at
https://phabricator.wikimedia.org/T278423 if you have any questions or
concerns.
This maintenance could be done without downtime but I'm opting to drain the
cluster to make for an easier recovery in case anything goes wrong.
Regards,
Razzi & the Data Engineering team
Hello!
We are trying to standardize the way we sanitize and retain event data in
Hive. For now, nothing will change for instrumentation data. What is
changing is that we are going to apply the same sanitization process to all
tables in the Hive event database, and then drop all data older than 90
days from all tables in the event database.
For analytics/instrumentation event tables, nothing is changing. If you
need to keep data longer than 90 days, you will need to add an entry
to the event_sanitized_analytics
allowlist
<https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/he…>
(this
was previously named eventlogging/whitelist.yaml) as described here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization#Al…>
.
For main / production event tables (e.g. mediawiki_revision_create, we are
now copying this data into the event_sanitized database. Tables to be
copied are listed in the event_sanitized_main allowlist
<https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/he…>
.
We will soon begin applying the same purging policy to all tables in the
event database. When that happens, main / production event tables in the
event database will no longer have data older than 90 days. If you need to
query data older than 90 days for these tables, you will find it in the
event_sanitized database.
In this way, all event table sanitization and retention in Hive is done in
the same way.
Docs have been updated; you can read more here:
- https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Sanitization
- https://wikitech.wikimedia.org/wiki/Analytics/Systems/Event_Data_retention
-Andrew Otto
SRE, Data Engineering
Hello!
tl;dr: We'd like to turn off Jupyter+Virtualenv (SWAP) in favor of
Jupyter+Conda (Newpyter) the week of May 3rd. Please help us test and
switch before then.
Over the last year, we've slowly been working on a replacement of the
current virtualenv based JupyterHub system (formerly known as SWAP) with a
new one based on Conda <https://docs.conda.io/en/latest/> (AKA Newpyter).
Everything should be in place to switch and decommission the virtualenv
based system you all are used to. Before we do...we have to make sure you
all use and are ok with the new setup!
We'd like to decommission Jupyter+Virtualenv (running on port 8000) the
week of May 3rd. In the meantime, please switch to Jupyter+Conda on port
8880. The documentation has been updated.
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter>
Summary of the changes:
- You will ssh tunnel to port 8880
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Access>
instead of port 8000.
- Your Notebook files will remain unchanged.
- Your local data files will remain unchanged.
- Your Python environment will change, so you may need to re-install
packages. See docs here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Conda_environ…>
and here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda#Installing_p…>
.
- PySpark, Scala Spark and Spark SQL and Spark-R kernels will be
removed. If you use the PySpark kernels currently, please port them to a
regular Python kernel using wmfdata-python to launch your SparkSession.
Docs here
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#PySpark>.
Please reach out with any questions, and report issues on this ticket
<https://phabricator.wikimedia.org/T224658>. If we encounter any blockers
along the way, we will postpone the May 3rd deadline.
Thank you!
- Andrew Otto + Data Engineering