Deprecation of Spark v2 scheduled for July 5th - Analytics

23 Jun 2023


      Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th 
2023. Its replacement, Spark 3 is already available and all of our 
production data pipelines have been migrated successfully to this new 
version 
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Spark_3. 
We have also assisted in the migration of several other teams’ Spark 2 
pipelines to Spark 3, but there may still be other Spark 2 jobs that are 
configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat 
you run, to verify that they have been upgraded to work with Spark 3. In 
most cases, this will mean checking that the command-line interfaces for 
spark 
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#Command-line_interfacesuse 
one of the supported forms, such as spark3-submitor pyspark3. In some 
cases this may also mean upgrading your conda environments 
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migrating_from_anaconda-wmf_to_conda-analyticson 
the stats servers from anaconda-wmfto conda-analytics, if you have not 
already done so.
The specific change that is scheduled to happen on July 5th is a switch 
of spark shuffler version used by YARN 
https://phabricator.wikimedia.org/T332765from 2 to 3. This should 
bring significant performance benefits for existing spark3 jobs, but it 
is more than likely that any spark2 jobs attempting to use this new 
shuffler will fail.
Please do reach out 
https://wikitech.wikimedia.org/wiki/Data_Engineering/Contactto the 
Data Engineering team if you have any queries or concerns about this 
change, or would like help in identifying whether or not you are likely 
to be affected.
-- 
    *Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation https://wikimediafoundation.org/