Russell Spitzer's Blog

Some guy's blog


Spark Loves Distributed filesystems, but sometimes you just want to write to wherever the driver is running. You may try use a file:// or something of that nature and run into a lot of strange errors or files located in random places. Never fear there is a simple solution with toLocalIterator.


The key issue here is that when you use a location of file:// every machine assume you are talking about it’s local filesystem. This can lead to madness, what you actually want is all of the data to write to a single file on the Driver’s Local filesystem.

A first instinct my be to use collect on the RDD before attempting to write it to a file but this has a distinct limitation. When you use collect every partition is moved from the remote cluster to the driver machine at the same time. This means if you use collect you can never write a file larger than driver heap.

toLocalIterator lets us get around this by only pulling down a single Spark partition’s worth of data to the DRiver at a time. This means that you can write as large a file as HDD space you have as long as no one Spark Partition is bigger than the driver heap.

Example of Writing an RDD using toLocalIterator

scala> import java.io._
import java.io._

scala> val pw = new PrintWriter(new File("LocalText"))
pw: java.io.PrintWriter = java.io.PrintWriter@5879197

scala> val rdd = sc.parallelize(1 to 100000).map( num => s"$num::Line")

scala> for (line <- rdd.toLocalIterator) { pw.println(line) }

scala> pw.close
scala>:quit

15:40:11   ~/SparkInstalls/spark-1.5.1-bin-hadoop1  tail LocalText
99991::Line
99992::Line
99993::Line
99994::Line
99995::Line
99996::Line
99997::Line
99998::Line
99999::Line
100000::Line

In the Application UI you’ll notice that the toLocalIterator runs a on each Spark Partition one at a time, rather than one single Job as in collect.