Russell Spitzer's Blog

Some guy's blog

How Spark Data Locality Works

Apache Spark is the data processing system that lets you have in-memory analytics and data locality, but how does that work? How does Spark know where the data is? The Spark UI is full of messages talking about NODE_LOCAL, ANY or PROCESS_LOCAL. What are these things and how do they get set? Let’s explore how Spark handles data locality when processing data.

Read Post

Dealing with Large Spark Partitions

One of the biggest issues with working with Spark and Cassandra is dealing with large Partitions. There are several issues we need to overcome before we can really handle the challenge well. I’m going to use this blogpost as a way of formalizing my thoughts on the issue. Let’s get into the details

Read Post

Spark Thrift Server Basics and a History

Spark (SQL) Thrift Server is an excellent tool built on the HiveServer2 for allowing multiple remote clients to access Spark. It provides a generic JDBC endpoint that lets any client including BI tools connect and access the power of Spark. Let’s talk about how it came to be and why you should use it.

Read Post

Passing Spark Cassandra Connector Options in Pyspark

Pyspark provides a great wrapper for DataFrame access but does come with a few little quirks. One that you may run into is trying to pass options to a DataFrame which include punctuation in their key names.

Read Post

Scratch Project - Gravity Chomper

Project Walk-through for the Scratch Class Taught at The Made. A simple character moves around eating floating objects. Gravity can be controlled to either draw or repel nearby objects.

Read Post

Concurrency in Spark

How does Spark actually execute code and how can I do concurrent work within a Spark execution pipeline? Spark is complicated, Concurrency is complicated, and distributed systems are also complicated. To answer some common questions I’m going to go over some basic details on what Spark is doing and how you can control it.

Read Post

Ordering with saveToCassandra

Responding to a question on Stack Overflow http://stackoverflow.com/questions/42020173/savetocassandra-is-there-any-ordering-in-which-the-rows-are-written/42031425#42031425

Read Post

Spark Applications are Fat

Spark Submit is great and you should use it.

Read Post

Distributed Locks are Hard

I recently was answering a Stack Overflow Question which made me start thinking a bit about locking and some assumptions made in Distributed systems. In this case we had what I find is a pretty common error in distributed systems and particularly with Cassandra.

Read Post

Is the Spark Cassandra Connector Asynchronous / How Spark Works

Are the Spark Cassandra Connector Api’s Async?

Read Post