Some guy's blog
I had a great time at DataStax Accelerate and got asked a lot of great questions about Cassandra and Spark. I’ll post some the most common ones and their answers here for posterity. —
I had a great time at DataStax Accelerate and got asked a lot of great questions about Cassandra and Spark. I’ll post some the most common ones and their answers here for posterity. —
One of the greatest features of Spark is the Dataframe API. It’s amazing and available in a variety of languages: R, Python, Java, Scala, and there is even work being done in .net! One of the drawbacks of RDD’s was that depending on the language you choose, your performance could be dramatically different (Scala and Java = Fast; Python = Slow) but in DataFrames every language will perform nearly identically (with some caveats.)
“My output worked in local mode, but now it’s all gone … where is it?”
In DSE 6.0 we bring an exciting new optimization to Apache Spark’s Catalyst engine. Previously when
doing a Join against a Cassandra Table catalyst would be forced in all cases to perform a full
table scan. In some cases this can be extremely inefficient when compared to doing point lookups
against a Cassandra table. In the RDD API we added a function joinWithCassandraTable
which allows
doing this optimized join but prior to DSE 6.0 there was no ability to use this in Catalyst. Now in
6.0 a joinWithCassandraTable
is performed automatically in SparkSQL and DataFrames.
Did you know Spark has a 2GB architectural limit on certain memory structures? I didn’t. Then I was helpfully pointed to SPARK-6235 which points out there are several places in the Spark code which use byte arrays and byte buffers. These objects are sized with INT which means anything larger than MAX_INT will cause failures. In practice this usually means a user running into this issue will need to fix their data’s Partitioning.
Apache Spark is the data processing system that lets you have in-memory analytics and data locality, but how does that work? How does Spark know where the data is? The Spark UI is full of messages talking about NODE_LOCAL, ANY or PROCESS_LOCAL. What are these things and how do they get set? Let’s explore how Spark handles data locality when processing data.
One of the biggest issues with working with Spark and Cassandra is dealing with large Partitions. There are several issues we need to overcome before we can really handle the challenge well. I’m going to use this blogpost as a way of formalizing my thoughts on the issue. Let’s get into the details
Spark (SQL) Thrift Server is an excellent tool built on the HiveServer2 for allowing multiple remote clients to access Spark. It provides a generic JDBC endpoint that lets any client including BI tools connect and access the power of Spark. Let’s talk about how it came to be and why you should use it.
Pyspark provides a great wrapper for DataFrame access but does come with a few little quirks. One that you may run into is trying to pass options to a DataFrame which include punctuation in their key names.