Russell Spitzer's Blog

Some guy's blog

DSE Direct Join Improving Catalyst Joins with Cassandra

In DSE 6.0 we bring an exciting new optimization to Apache Spark’s Catalyst engine. Previously when doing a Join against a Cassandra Table catalyst would be forced in all cases to perform a full table scan. In some cases this can be extremely inefficient when compared to doing point lookups against a Cassandra table. In the RDD API we added a function joinWithCassandraTable which allows doing this optimized join but prior to DSE 6.0 there was no ability to use this in Catalyst. Now in 6.0 a joinWithCassandraTable is performed automatically in SparkSQL and DataFrames.

Read Post

Spark Partitions and the 2GB Limit

Did you know Spark has a 2GB architectural limit on certain memory structures? I didn’t. Then I was helpfully pointed to SPARK-6235 which points out there are several places in the Spark code which use byte arrays and byte buffers. These objects are sized with INT which means anything larger than MAX_INT will cause failures. In practice this usually means a user running into this issue will need to fix their data’s Partitioning.

Read Post

Secrets of the Scala REPL

Scala 2.11 from here on out!

Read Post

How Spark Data Locality Works

Apache Spark is the data processing system that lets you have in-memory analytics and data locality, but how does that work? How does Spark know where the data is? The Spark UI is full of messages talking about NODE_LOCAL, ANY or PROCESS_LOCAL. What are these things and how do they get set? Let’s explore how Spark handles data locality when processing data.

Read Post

Dealing with Large Spark Partitions

One of the biggest issues with working with Spark and Cassandra is dealing with large Partitions. There are several issues we need to overcome before we can really handle the challenge well. I’m going to use this blogpost as a way of formalizing my thoughts on the issue. Let’s get into the details

Read Post

Spark Thrift Server Basics and a History

Spark (SQL) Thrift Server is an excellent tool built on the HiveServer2 for allowing multiple remote clients to access Spark. It provides a generic JDBC endpoint that lets any client including BI tools connect and access the power of Spark. Let’s talk about how it came to be and why you should use it.

Read Post

Passing Spark Cassandra Connector Options in Pyspark

Pyspark provides a great wrapper for DataFrame access but does come with a few little quirks. One that you may run into is trying to pass options to a DataFrame which include punctuation in their key names.

Read Post

Scratch Project - Gravity Chomper

Project Walk-through for the Scratch Class Taught at The Made. A simple character moves around eating floating objects. Gravity can be controlled to either draw or repel nearby objects.

Read Post

Concurrency in Spark

How does Spark actually execute code and how can I do concurrent work within a Spark execution pipeline? Spark is complicated, Concurrency is complicated, and distributed systems are also complicated. To answer some common questions I’m going to go over some basic details on what Spark is doing and how you can control it.

Read Post

Ordering with saveToCassandra

Responding to a question on Stack Overflow http://stackoverflow.com/questions/42020173/savetocassandra-is-there-any-ordering-in-which-the-rows-are-written/42031425#42031425

Read Post