SparkContext stopped and cannot be restarted

List overview All Threads
Download

newer

older

Tutorials on disk space usage for...

Fwd: [Wikitech-l] [Wikimedia...

Neil Shah-Quinn

6 Feb 2020 6 Feb '20

7:09 a.m.

Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Neil Shah-Quinn

6 Feb 6 Feb

7:44 a.m.

Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...

Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Luca Toscano

2:23 p.m.

Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...

Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Neil Shah-Quinn

7 Feb 7 Feb

7:30 a.m.

Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...

Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

9:53 p.m.

Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...

Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

10:48 p.m.

Hello,

Probably this discussion is not of wide interest to this public list, I suggest to move it to analytics-internal?

Thanks,

Nuria

On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto otto@wikimedia.org wrote:

...

Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Neil Shah-Quinn

8 Feb 8 Feb

3:17 a.m.

Good suggestions, Andrew! I'll try those if I encounter this again.

Nuria, we had a discussion about the appropriate places to ask questions about internal systems in October 2018, and the verdict (supported by you) was that we should use this list or the public IRC channel.

If you want to revisit that decision, I'd suggest you consult that thread first (the subject was "Where to ask questions about internal analytics tools") because I included a detailed list of pros and cons of different channels to start the discussion. In that list, I even mentioned that such discussions on this channel could annoy subscribers who don't have access to these systems 🙂

If you still want us to use a different list, we can certainly do that. If so, please send my team a message and update the docs I added https://wikitech.wikimedia.org/wiki/Analytics#Contact so it stays clear.

On Fri, 7 Feb 2020 at 07:48, Nuria Ruiz nruiz@wikimedia.org wrote:

...

Hello,

Probably this discussion is not of wide interest to this public list, I suggest to move it to analytics-internal?

Thanks,

Nuria

On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto otto@wikimedia.org wrote:

...
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

3:44 a.m.

...

and the verdict (supported by you) was that we should use this list or

the public IRC channel. Indeed, eh? I suggest we revisit that to send questions to analytics-internal but if others disagree, I am fine with either.

On Fri, Feb 7, 2020 at 12:17 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...

Good suggestions, Andrew! I'll try those if I encounter this again.

Nuria, we had a discussion about the appropriate places to ask questions about internal systems in October 2018, and the verdict (supported by you) was that we should use this list or the public IRC channel.

If you want to revisit that decision, I'd suggest you consult that thread first (the subject was "Where to ask questions about internal analytics tools") because I included a detailed list of pros and cons of different channels to start the discussion. In that list, I even mentioned that such discussions on this channel could annoy subscribers who don't have access to these systems 🙂

If you still want us to use a different list, we can certainly do that. If so, please send my team a message and update the docs I added https://wikitech.wikimedia.org/wiki/Analytics#Contact so it stays clear.

On Fri, 7 Feb 2020 at 07:48, Nuria Ruiz nruiz@wikimedia.org wrote:

...
Hello,

Probably this discussion is not of wide interest to this public list, I suggest to move it to analytics-internal?

Thanks,

Nuria

On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto otto@wikimedia.org wrote:

...
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn < nshahquinn@wikimedia.org> wrote:

> Hey there! > > I was running SQL queries via PySpark (using the wmfdata package > https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) > on SWAP when one of my queries failed with "java.lang.OutofMemoryError: > Java heap space". > > After that, when I tried to call the spark.sql function again (via > wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot > call methods on a stopped SparkContext." > > When I tried to create a new Spark context using > SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session > or directly), it returned a SparkContent object properly, but calling the > object's sql function still gave the "stopped SparkContext error". > > Any idea what's going on? I assume restarting the notebook kernel > would take care of the problem, but it seems like there has to be a better > way to recover. > > Thank you! > > > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

4:13 a.m.

On Fri, Feb 7, 2020 at 12:45 PM Nuria Ruiz nruiz@wikimedia.org wrote:

...

...
and the verdict (supported by you) was that we should use this list or

the public IRC channel. Indeed, eh? I suggest we revisit that to send questions to analytics-internal but if others disagree, I am fine with either.

my 2 cents: I prefer the public list as the conversation can be relevant to my team (Research) as well. At the moment, if I see something is not of immediate interest to me, I mute the thread. That's quite easy/cheap on my end. If the frequency of this kind of question on the list increases significantly, I'd suggest adding a tag to the subject line that allows people to filter appropriately.

Leila

Neil Shah-Quinn

15 Feb 15 Feb

6:30 a.m.

I ran into this problem again, and I found that neither session.stop or newSession got rid of the error. So it's still not clear how to recover from a crashed(?) Spark session.

On the other hand, I did figure out why my sessions were crashing in the first place, so hopefully recovering from that will be a rare need. The reason is that wmfdata doesn't modify https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60 the default Spark when it starts a new session, which was (for example) causing it to start executors with only ~400 MiB of memory each.

I'm definitely going to change that, but it's not completely clear what the recommended settings for our cluster are. I cataloged the different recommendations at https://phabricator.wikimedia.org/T245097, and it would very helpful if one of y'all could give some clear recommendations about what the settings should be for local SWAP, YARN, and "large" YARN jobs. For example, is it important to increase spark.sql.shuffle.partitions for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local job when the SWAP servers only have 64 GiB total?

Thank you!

On Fri, 7 Feb 2020 at 06:53, Andrew Otto otto@wikimedia.org wrote:

...

Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Neil Shah-Quinn

20 Feb 20 Feb

1:35 a.m.

Bump!

Analytics team, I'm eager to have input from y'all about the best Spark settings to use.

On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...

I ran into this problem again, and I found that neither session.stop or newSession got rid of the error. So it's still not clear how to recover from a crashed(?) Spark session.

On the other hand, I did figure out why my sessions were crashing in the first place, so hopefully recovering from that will be a rare need. The reason is that wmfdata doesn't modify https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60 the default Spark when it starts a new session, which was (for example) causing it to start executors with only ~400 MiB of memory each.

I'm definitely going to change that, but it's not completely clear what the recommended settings for our cluster are. I cataloged the different recommendations at https://phabricator.wikimedia.org/T245097, and it would very helpful if one of y'all could give some clear recommendations about what the settings should be for local SWAP, YARN, and "large" YARN jobs. For example, is it important to increase spark.sql.shuffle.partitions for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local job when the SWAP servers only have 64 GiB total?

Thank you!

On Fri, 7 Feb 2020 at 06:53, Andrew Otto otto@wikimedia.org wrote:

...
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hey there!

I was running SQL queries via PySpark (using the wmfdata package https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) on SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java heap space".

After that, when I tried to call the spark.sql function again (via wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext."

When I tried to create a new Spark context using SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session or directly), it returned a SparkContent object properly, but calling the object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would take care of the problem, but it seems like there has to be a better way to recover.

Thank you!

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Neil Shah-Quinn

12:20 p.m.

Another update: I'm continuing to encounter these Spark errors and have trouble recovering from them, even when I use proper settings. I've filed T245713 https://phabricator.wikimedia.org/T245713 to discuss this further. The specific errors and behavior I'm seeing (for example, whether explicitly calling session.stop allows a new functioning session to be created) are not consistent, so I'm still trying to make sense of it.

I would greatly appreciate any input or help, even if it's identifying places where my description doesn't make sense. https://phabricator.wikimedia.org/T245713 https://phabricator.wikimedia.org/T245713

On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...

Bump!

Analytics team, I'm eager to have input from y'all about the best Spark settings to use.

On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
I ran into this problem again, and I found that neither session.stop or newSession got rid of the error. So it's still not clear how to recover from a crashed(?) Spark session.

On the other hand, I did figure out why my sessions were crashing in the first place, so hopefully recovering from that will be a rare need. The reason is that wmfdata doesn't modify https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60 the default Spark when it starts a new session, which was (for example) causing it to start executors with only ~400 MiB of memory each.

I'm definitely going to change that, but it's not completely clear what the recommended settings for our cluster are. I cataloged the different recommendations at https://phabricator.wikimedia.org/T245097, and it would very helpful if one of y'all could give some clear recommendations about what the settings should be for local SWAP, YARN, and "large" YARN jobs. For example, is it important to increase spark.sql.shuffle.partitions for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local job when the SWAP servers only have 64 GiB total?

Thank you!

On Fri, 7 Feb 2020 at 06:53, Andrew Otto otto@wikimedia.org wrote:

...
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Whoa—I just got the same stopped SparkContext error on the query even after restarting the notebook, without an intermediate Java heap space error. That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn < nshahquinn@wikimedia.org> wrote:

> Hey there! > > I was running SQL queries via PySpark (using the wmfdata package > https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) > on SWAP when one of my queries failed with "java.lang.OutofMemoryError: > Java heap space". > > After that, when I tried to call the spark.sql function again (via > wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot > call methods on a stopped SparkContext." > > When I tried to create a new Spark context using > SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session > or directly), it returned a SparkContent object properly, but calling the > object's sql function still gave the "stopped SparkContext error". > > Any idea what's going on? I assume restarting the notebook kernel > would take care of the problem, but it seems like there has to be a better > way to recover. > > Thank you! > > > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Luca Toscano

5:27 p.m.

Hi Neil,

I added the Analytics tag to https://phabricator.wikimedia.org/T245097, and also thanks for filing https://phabricator.wikimedia.org/T245713. We periodically review tasks in our incoming queue, so we should be able to help soon, but it may depend on priorities.

Luca

Il giorno gio 20 feb 2020 alle ore 06:21 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...

Another update: I'm continuing to encounter these Spark errors and have trouble recovering from them, even when I use proper settings. I've filed T245713 https://phabricator.wikimedia.org/T245713 to discuss this further. The specific errors and behavior I'm seeing (for example, whether explicitly calling session.stop allows a new functioning session to be created) are not consistent, so I'm still trying to make sense of it.

I would greatly appreciate any input or help, even if it's identifying places where my description doesn't make sense. https://phabricator.wikimedia.org/T245713 https://phabricator.wikimedia.org/T245713

On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Bump!

Analytics team, I'm eager to have input from y'all about the best Spark settings to use.

On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
I ran into this problem again, and I found that neither session.stop or newSession got rid of the error. So it's still not clear how to recover from a crashed(?) Spark session.

On the other hand, I did figure out why my sessions were crashing in the first place, so hopefully recovering from that will be a rare need. The reason is that wmfdata doesn't modify https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60 the default Spark when it starts a new session, which was (for example) causing it to start executors with only ~400 MiB of memory each.

I'm definitely going to change that, but it's not completely clear what the recommended settings for our cluster are. I cataloged the different recommendations at https://phabricator.wikimedia.org/T245097, and it would very helpful if one of y'all could give some clear recommendations about what the settings should be for local SWAP, YARN, and "large" YARN jobs. For example, is it important to increase spark.sql.shuffle.partitions for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local job when the SWAP servers only have 64 GiB total?

Thank you!

On Fri, 7 Feb 2020 at 06:53, Andrew Otto otto@wikimedia.org wrote:

...
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn < nshahquinn@wikimedia.org> wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

...
Hey Neil,

there were two Yarn jobs running related to your notebooks, I just killed them, let's see if it solves the problem (you might need to restart again your notebook). If not, let's open a task and investigate :)

Luca

Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

> Whoa—I just got the same stopped SparkContext error on the query > even after restarting the notebook, without an intermediate Java heap space > error. That seems very strange to me. > > On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn < > nshahquinn@wikimedia.org> wrote: > >> Hey there! >> >> I was running SQL queries via PySpark (using the wmfdata package >> https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) >> on SWAP when one of my queries failed with "java.lang.OutofMemoryError: >> Java heap space". >> >> After that, when I tried to call the spark.sql function again (via >> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot >> call methods on a stopped SparkContext." >> >> When I tried to create a new Spark context using >> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session >> or directly), it returned a SparkContent object properly, but calling the >> object's sql function still gave the "stopped SparkContext error". >> >> Any idea what's going on? I assume restarting the notebook kernel >> would take care of the problem, but it seems like there has to be a better >> way to recover. >> >> Thank you! >> >> >> _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

26 Feb 26 Feb

2:04 a.m.

Hello:

Following up on this issue, We think many of neil's issues come from the fact that a kerberos ticket expires after 24 hours and once it does your spark session would not work anymore. We will be extending expiration of tickets somewhat to 2/3 days but main point to take home is that jupyter notebooks do not live forever in the state you live them at, a restart of kernel might be needed.

Please take a look at ticket: https://phabricator.wikimedia.org/T246132

If anybody has been having these similar problems please chime in.

Thanks,

Nuria

On Thu, Feb 20, 2020 at 2:27 AM Luca Toscano ltoscano@wikimedia.org wrote:

...

Hi Neil,

I added the Analytics tag to https://phabricator.wikimedia.org/T245097, and also thanks for filing https://phabricator.wikimedia.org/T245713. We periodically review tasks in our incoming queue, so we should be able to help soon, but it may depend on priorities.

Luca

Il giorno gio 20 feb 2020 alle ore 06:21 Neil Shah-Quinn < nshahquinn@wikimedia.org> ha scritto:

...
Another update: I'm continuing to encounter these Spark errors and have trouble recovering from them, even when I use proper settings. I've filed T245713 https://phabricator.wikimedia.org/T245713 to discuss this further. The specific errors and behavior I'm seeing (for example, whether explicitly calling session.stop allows a new functioning session to be created) are not consistent, so I'm still trying to make sense of it.

I would greatly appreciate any input or help, even if it's identifying places where my description doesn't make sense. https://phabricator.wikimedia.org/T245713 https://phabricator.wikimedia.org/T245713

On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
Bump!

Analytics team, I'm eager to have input from y'all about the best Spark settings to use.

On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn nshahquinn@wikimedia.org wrote:

...
I ran into this problem again, and I found that neither session.stop or newSession got rid of the error. So it's still not clear how to recover from a crashed(?) Spark session.

On the other hand, I did figure out why my sessions were crashing in the first place, so hopefully recovering from that will be a rare need. The reason is that wmfdata doesn't modify https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60 the default Spark when it starts a new session, which was (for example) causing it to start executors with only ~400 MiB of memory each.

I'm definitely going to change that, but it's not completely clear what the recommended settings for our cluster are. I cataloged the different recommendations at https://phabricator.wikimedia.org/T245097, and it would very helpful if one of y'all could give some clear recommendations about what the settings should be for local SWAP, YARN, and "large" YARN jobs. For example, is it important to increase spark.sql.shuffle.partitions for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local job when the SWAP servers only have 64 GiB total?

Thank you!

On Fri, 7 Feb 2020 at 06:53, Andrew Otto otto@wikimedia.org wrote:

...
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, ( session.stop() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop?) or maybe try to explicitly create a new session via newSession() https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession ?

On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn < nshahquinn@wikimedia.org> wrote:

...
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I found that the Yarn job corresponding to the SparkContext was marked as "successful", but I still couldn't get SparkSession.builder.getOrCreate to open a new one.

Any idea what might have caused that or how I could recover without restarting the notebook, which could mean losing a lot of in-progress work? I had already restarted that kernel so I don't know if I'll encounter this problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano ltoscano@wikimedia.org wrote:

> Hey Neil, > > there were two Yarn jobs running related to your notebooks, I just > killed them, let's see if it solves the problem (you might need to restart > again your notebook). If not, let's open a task and investigate :) > > Luca > > Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < > nshahquinn@wikimedia.org> ha scritto: > >> Whoa—I just got the same stopped SparkContext error on the query >> even after restarting the notebook, without an intermediate Java heap space >> error. That seems very strange to me. >> >> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn < >> nshahquinn@wikimedia.org> wrote: >> >>> Hey there! >>> >>> I was running SQL queries via PySpark (using the wmfdata package >>> https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py) >>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError: >>> Java heap space". >>> >>> After that, when I tried to call the spark.sql function again (via >>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot >>> call methods on a stopped SparkContext." >>> >>> When I tried to create a new Spark context using >>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session >>> or directly), it returned a SparkContent object properly, but calling the >>> object's sql function still gave the "stopped SparkContext error". >>> >>> Any idea what's going on? I assume restarting the notebook kernel >>> would take care of the problem, but it seems like there has to be a better >>> way to recover. >>> >>> Thank you! >>> >>> >>> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

1774

Age (days ago)

1793

Last active (days ago)

analytics@lists.wikimedia.org

13 comments

5 participants

tags (0)

participants (5)

Andrew Otto
Leila Zia
Luca Toscano
Neil Shah-Quinn
Nuria Ruiz