Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th
2023. Its replacement, Spark 3 is already available and all of our
production data pipelines have been migrated successfully to this new
version
<https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Spark_3>.
We have also assisted in the migration of several other teams’ Spark 2
pipelines to Spark 3, but there may still be other Spark 2 jobs that are
configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat
you run, to verify that they have been upgraded to work with Spark 3. In
most cases, this will mean checking that the command-line interfaces for
spark
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#Command-line_interfaces>use
one of the supported forms, such as spark3-submitor pyspark3. In some
cases this may also mean upgrading your conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migrating_from_anaconda-wmf_to_conda-analytics>on
the stats servers from anaconda-wmfto conda-analytics, if you have not
already done so.
The specific change that is scheduled to happen on July 5th is a switch
of spark shuffler version used by YARN
<https://phabricator.wikimedia.org/T332765>from 2 to 3. This should
bring significant performance benefits for existing spark3 jobs, but it
is more than likely that any spark2 jobs attempting to use this new
shuffler will fail.
Please do reach out
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Contact>to the
Data Engineering team if you have any queries or concerns about this
change, or would like help in identifying whether or not you are likely
to be affected.
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Show replies by date