Russell Spitzer's Blog

Some guy's blog

Can I use Spark SQL to get Low Latency Cassandra Requests

I had a great time at DataStax Accelerate and got asked a lot of great questions about Cassandra and Spark. I’ll post some the most common ones and their answers here for posterity. —

Can I user Spark SQL to get Low Latency Cassandra Requests?

Read Post

What language should I use with Spark?

I had a great time at DataStax Accelerate and got asked a lot of great questions about Cassandra and Spark. I’ll post some the most common ones and their answers here for posterity. —

What language should I use when working with Spark?

One of the greatest features of Spark is the Dataframe API. It’s amazing and available in a variety of languages: R, Python, Java, Scala, and there is even work being done in .net! One of the drawbacks of RDD’s was that depending on the language you choose, your performance could be dramatically different (Scala and Java = Fast; Python = Slow) but in DataFrames every language will perform nearly identically (with some caveats.)

Read Post

Where does my Spark Output go?

“My output worked in local mode, but now it’s all gone … where is it?”

Read Post

DSE Direct Join Improving Catalyst Joins with Cassandra

In DSE 6.0 we bring an exciting new optimization to Apache Spark’s Catalyst engine. Previously when doing a Join against a Cassandra Table catalyst would be forced in all cases to perform a full table scan. In some cases this can be extremely inefficient when compared to doing point lookups against a Cassandra table. In the RDD API we added a function joinWithCassandraTable which allows doing this optimized join but prior to DSE 6.0 there was no ability to use this in Catalyst. Now in 6.0 a joinWithCassandraTable is performed automatically in SparkSQL and DataFrames.

Read Post

Spark Partitions and the 2GB Limit

Did you know Spark has a 2GB architectural limit on certain memory structures? I didn’t. Then I was helpfully pointed to SPARK-6235 which points out there are several places in the Spark code which use byte arrays and byte buffers. These objects are sized with INT which means anything larger than MAX_INT will cause failures. In practice this usually means a user running into this issue will need to fix their data’s Partitioning.

Read Post

Secrets of the Scala REPL

Scala 2.11 from here on out!

Read Post

How Spark Data Locality Works

Apache Spark is the data processing system that lets you have in-memory analytics and data locality, but how does that work? How does Spark know where the data is? The Spark UI is full of messages talking about NODE_LOCAL, ANY or PROCESS_LOCAL. What are these things and how do they get set? Let’s explore how Spark handles data locality when processing data.

Read Post

Dealing with Large Spark Partitions

One of the biggest issues with working with Spark and Cassandra is dealing with large Partitions. There are several issues we need to overcome before we can really handle the challenge well. I’m going to use this blogpost as a way of formalizing my thoughts on the issue. Let’s get into the details

Read Post

Spark Thrift Server Basics and a History

Spark (SQL) Thrift Server is an excellent tool built on the HiveServer2 for allowing multiple remote clients to access Spark. It provides a generic JDBC endpoint that lets any client including BI tools connect and access the power of Spark. Let’s talk about how it came to be and why you should use it.

Read Post

Passing Spark Cassandra Connector Options in Pyspark

Pyspark provides a great wrapper for DataFrame access but does come with a few little quirks. One that you may run into is trying to pass options to a DataFrame which include punctuation in their key names.

Read Post