Russell Spitzer's Blog

Some guy's blog

Pruning and Spark DataSources, A Love Story

I was looking through a pile of Jiras and I noticed an interesting complaint that DataFrame pruning was broken for the Spark Cassandra Connector. The ticket noted that even when very specific columns were selected, it seemed like the Connector was pulling all of the rows from the source Cassandra table. This is surprising, since that particular part of the connector code has some rather heavy testing coverage and there haven’t been any comments on this feature not working from anyone else. Compared to predicate pushdown, pruning is easy so what went wrong?

Read Post

Debugging Catalyst and Predicate Pushdown with Spark Cassandra Connector


Making sure your code is actually pushing down predicates to C* is slightly confusing. In this post we’ll go over the basics on setting up debugging and how to workaround a few common issues.

Read Post

Utilizing Multiple C* Clusters using the Spark Cassandra Connector Part II


Talking to Multiple Clusters, Now with Spark SQL

Read Post

Sbt Assembly (Fat Jars) With Spark


Classpaths are almost always the first error folks run into when writing custom applications for Spark. The difficult usually centers around the fact that there are many Spark processes and their special class-loaders. Most folks get around these issue by building Fat Jars with sbt assembly but not everyone needs to do this.

Read Post

Utilizing Multiple C* Clusters using the Spark Cassandra Connector


Most folks don’t know that the Spark Cassandra Connector is actually able to connect to multiple Cassandra clusters at the same time. This allows us to move data between Cassandra clusters or even manage multiple clusters from the same application (or even the spark shell)

Read Post

Writing to the Driver FileSystem using Spark


Spark Loves Distributed filesystems, but sometimes you just want to write to wherever the driver is running. You may try use a file:// or something of that nature and run into a lot of strange errors or files located in random places. Never fear there is a simple solution with toLocalIterator.


Read Post

Working with Cassandra UDTs and Spark Dataframes


We just fixed a bug which was stopping DataFrames from being able to write into Cassandra UDTs. But I noticed there aren’t a lot of great documents around how this works. Here is just a quick example on how you can make a dataframe which can insert into a C* UDT.


Read Post

Folding with Spark


I felt the need to write this post after I read the blog post which did a great job at explaining how fold and foldByKey worked. The only thing I thought was missing from this rundown was a bit of detail on how these operations work differently than their scala counterparts.

Read Post

Exploring Tombstone Behavior with CQL on Cassandra 2.0 and 2.1

##Cassandra 2.1 17:06:16 ➜ ~/repos/RussellSpitzer.github.io/ExampleScripts git:(master) ✗ ./TombstoneExperiment.sh

Read Post

Loading a CassandraRDD into a HiveContext in Spark

Spark is awesome and I love it. SparkSQL is also awesome but unfortunately is not fully mature. Although the folks at DataBrix have talked about how it will eventually become as full ANSI SQL langauge that time is honestly far off. This means that most folks will want to fall back onto HiveQL for doing their more complicated queries on Spark.

Read Post