Some guy's blog
“My output worked in local mode, but now it’s all gone … where is it?”
One of my favorite, and perhaps most common, debugging
technique is the good old
println("Got to this point"). Try this in
Spark and you may have a lot of unfortunate disappearing output … or so it seems.
Let’s take a quick example in the Scala Shell using the
Local Spark Master.
scala> println("Hello World") Hello World scala> sc.parallelize(1 to 2).foreach(println) 2 1
Everything worked exactly the way we wanted! All of our output appeared exactly how we expected in the shell. But let’s see what happens when we run the Spark Shell with the Standalone (or DSE) Spark Master.
scala> println("Hello World") Hello World scala> sc.parallelize(1 to 2).foreach(println) scala>
Where did it go! Did those DSE dev’s just break Spark? (Hint: No.)
This is an expected behavior from Spark! So why did our output vanish? It
didn’t, it just ended up somewhere else!
println sends our
STDOUT but only to
STDOUT of the process where the code
In the above example we actually have 2 different processes running user
code, the Spark-Shell (acting as the Spark Driver) and the Spark Executor.
The Executor is the process actually running our remote code in Spark.
Since the executor runs the
println inside the foreach, the
uses the EXecutor’s
STDOUT not the Spark Shell’s. But this does
not mean our output is lost!
The Executor process sends it’s output to a special set of files in it’s
working directory. This directory by default is a place that looks like
/var/lib/spark/worker/app-#/executor#/std[out|err] on DSE and
work/app-#/executor#/stdout in my Stand Alone OSS Spark install.
Let’s take a look
16:37:18 ➜ ~/SparkInstalls/spark-2.2.1-bin-hadoop2.7 cat work/app-20181127160938-0000/0/stdout 2 1
Our output! Exactly where we told it to be
STDOUT, just not the Shell
This is just a little thing we always need be aware of when we are running
code in our distributed framework. Sometimes our code doesn’t run in the
process we expect!