Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th 2023. Its replacement, Spark 3 is already available and all of our production data pipelines have been migrated successfully to this new version https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Spark_3. We have also assisted in the migration of several other teams’ Spark 2 pipelines to Spark 3, but there may still be other Spark 2 jobs that are configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat you run, to verify that they have been upgraded to work with Spark 3. In most cases, this will mean checking that the command-line interfaces for spark https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#Command-line_interfacesuse one of the supported forms, such as spark3-submitor pyspark3. In some cases this may also mean upgrading your conda environments https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migrating_from_anaconda-wmf_to_conda-analyticson the stats servers from anaconda-wmfto conda-analytics, if you have not already done so.
The specific change that is scheduled to happen on July 5th is a switch of spark shuffler version used by YARN https://phabricator.wikimedia.org/T332765from 2 to 3. This should bring significant performance benefits for existing spark3 jobs, but it is more than likely that any spark2 jobs attempting to use this new shuffler will fail.
Please do reach out https://wikitech.wikimedia.org/wiki/Data_Engineering/Contactto the Data Engineering team if you have any queries or concerns about this change, or would like help in identifying whether or not you are likely to be affected.