Spark 2 now available in Hadoop - Analytics

13 Nov 2017

Hi all!

We’ve recently made Spark 2.1 available in the Analytics Hadoop cluster.
It is installed on stat1004 and stat1005 alongside Spark 1.6.  To use Spark
2, you should access it via the spark2* (and pyspark2) executables, rather
than the usual spark-shell, spark-submit, etc.

I’ve added a little bit of documentation
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark> about
this on wikitech.

We’d like to deploy Spark 2.2, but we first need to upgrade Hadoop to use
Java 8 rather than Java 7.  Hopefully this will happen in early 2018.

analytics/refinery/source
<https://github.com/wikimedia/analytics-refinery-source> still uses Spark
1, but we’d also like to update jobs and dependencies there to use Spark 2
soon.

Anyway, let me know if there are any questions.  Enjoy!

- Andrew Otto
  Systems Engineer, WMF