Deprecation of Spark v2 scheduled for July 5th 


The Data Engineering team is planning to deprecate Spark 2 on July 5th 2023. Its replacement, Spark 3 is already available and all of our production data pipelines have been migrated successfully to this new version. We have also assisted in the migration of several other teams’ Spark 2 pipelines to Spark 3, but there may still be other Spark 2 jobs that are configured in code outside of our control.


We encourage you, therefore, to review any of your own Spark jobs that you run, to verify that they have been upgraded to work with Spark 3. In most cases, this will mean checking that the command-line interfaces for spark use one of the supported forms, such as spark3-submit or pyspark3. In some cases this may also mean upgrading your conda environments on the stats servers from anaconda-wmf to conda-analytics, if you have not already done so.


The specific change that is scheduled to happen on July 5th is a switch of spark shuffler version used by YARN from 2 to 3. This should bring significant performance benefits for existing spark3 jobs, but it is more than likely that any spark2 jobs attempting to use this new shuffler will fail.


Please do reach out to the Data Engineering team if you have any queries or concerns about this change, or would like help in identifying whether or not you are likely to be affected.

--
Ben Tullis (he/him)
Senior Site Reliability Engineer
Wikimedia Foundation