Some guy's blog
Classpaths are almost always the first error folks run into when writing custom applications for Spark. The difficult usually centers around the fact that there are many Spark processes and their special class-loaders. Most folks get around these issue by building Fat Jars with sbt assembly but not everyone needs to do this.
Lets start with a quick overview of whats going on in Spark class-loaders. On the Driver and on the Executor we have two class-loaders.
The System Class-loader By default this is where your SparkAssembly jar is going to get loaded as well as all other core spark components
The Dynamic App Class-loader Anything added by –jars and –packages is going to be loaded here
When you are running your application, make sure that your code AND all of it’s
dependencies are present on both the Driver and on the Executor. Now while your code
most likely depends on Spark, Spark is on the classpath of both the driver and and the executor
already so there is no need to worry about that. Your application code will also
automatically be placed on the executor cp (if you use spark-submit
) so don’t worry about that.
For everything else you have two basic options:
--jars
and --packages
Using these commands you can specify specific external dependencies for your application. These can
include things like the Spark Cassandra Connector or Your Secret Sauce ML Lib Which Determines the
Cuteness of Cat Pics. The Driver will let the executors know about these resources. The executors will
pull the jar from Driver if necessarily, or some other source if specified (like hdfs://
or spark-packages
.) These libraries will be placed on the App class-loader.
Fat jars are basically an attempt to make a single jar that contains not just your application
code but also all the code for all the dependencies that code uses. If you built a Jar that depended
on org.apache.commons, your Fat Jar would actually contain folders org/apache/commons
and all
of the classes inside of it. This Fat Jar wouldn’t require the Commons jar on the classpath because
all of the classes are already there!
That does sound pretty nifty, but how do we actually build a Fat Jar? Most build systems have an Assembly plugin. For Scala sbt-assembly is the most common build system but the same general principles apply to all Fat Jar building. When actually using these tools we can run into some issues. The most common problem is that when we are building a Fat Jar we can run into duplication problems.
Normally there aren’t severe issues if multiple jars have similar (but slightly different versioned)
dependencies but assembly runs into an issue. After all, what should your assembly do if it tries
to add a file to org/apache/commons
like StringUtils.class but it sees to different versions of
this code on your build path? This leads to a deduplication error. 99% of the time if this happens
on a class file something is broken and you most likely would run into issues if you ran the code.
This error is usually a sign that your build is impossible and things will not work as you like but
sometimes it’s innocuous. For example, it’s fine to skip any files that have conflicts that just
talk about META-INF or pom.xml. After all you are building a new jar anyway so these other
inventory files are useless anyway.
For all other files you should first try your best to make a clean assembly. The way to do this is to first make sure that everything on the CP already is marked as “provided” (not included in your Fat Jar). This means nothing that starts with “spark” should be in your assembly (unless your distro happens to skip that lib.) If you do leave one of these libs in your jar, you will only suffer sadness.
If at all possible don’t make a Fat Jar, –jars and –packages should be simpler for most use cases.
If you do build a Fat Jar