Some guy's blog
“My output worked in local mode, but now it’s all gone … where is it?”
In DSE 6.0 we bring an exciting new optimization to Apache Spark’s Catalyst engine. Previously when
doing a Join against a Cassandra Table catalyst would be forced in all cases to perform a full
table scan. In some cases this can be extremely inefficient when compared to doing point lookups
against a Cassandra table. In the RDD API we added a function
joinWithCassandraTable which allows
doing this optimized join but prior to DSE 6.0 there was no ability to use this in Catalyst. Now in
joinWithCassandraTable is performed automatically in SparkSQL and DataFrames.
Did you know Spark has a 2GB architectural limit on certain memory structures? I didn’t. Then I was helpfully pointed to SPARK-6235 which points out there are several places in the Spark code which use byte arrays and byte buffers. These objects are sized with INT which means anything larger than MAX_INT will cause failures. In practice this usually means a user running into this issue will need to fix their data’s Partitioning.
Scala 2.11 from here on out!
Apache Spark is the data processing system that lets you have in-memory analytics and data locality, but how does that work? How does Spark know where the data is? The Spark UI is full of messages talking about NODE_LOCAL, ANY or PROCESS_LOCAL. What are these things and how do they get set? Let’s explore how Spark handles data locality when processing data.
One of the biggest issues with working with Spark and Cassandra is dealing with large Partitions. There are several issues we need to overcome before we can really handle the challenge well. I’m going to use this blogpost as a way of formalizing my thoughts on the issue. Let’s get into the details
Spark (SQL) Thrift Server is an excellent tool built on the HiveServer2 for allowing multiple remote clients to access Spark. It provides a generic JDBC endpoint that lets any client including BI tools connect and access the power of Spark. Let’s talk about how it came to be and why you should use it.
Pyspark provides a great wrapper for DataFrame access but does come with a few little quirks. One that you may run into is trying to pass options to a DataFrame which include punctuation in their key names.
How does Spark actually execute code and how can I do concurrent work within a Spark execution pipeline? Spark is complicated, Concurrency is complicated, and distributed systems are also complicated. To answer some common questions I’m going to go over some basic details on what Spark is doing and how you can control it.