Russell Spitzer's Blog: Debugging Catalyst and Predicate Pushdown with Spark Cassandra Connector

Making sure your code is actually pushing down predicates to C* is slightly confusing. In this post we’ll go over the basics on setting up debugging and how to workaround a few common issues.

Let’s start with a common table schema like

CREATE TABLE test.common (
    year int,
    birthday timestamp,
    userid uuid,
    likes text,
    name text,
    PRIMARY KEY (year, birthday, userid)
)

And I’ll insert a few junk rows

year | birthday                 | userid                               | likes   | name
------+--------------------------+--------------------------------------+---------+-------------------
1985 | 1985-02-05 00:00:00+0000 | 74cc615c-05b9-11e6-b512-3e1d05defe78 |  soccer | Cristiano Ronaldo
1980 | 1980-10-03 00:00:00+0000 | 74cc615c-05b9-11e6-b512-3e1d05defe79 | twitter |   Kim Kardashian

This table is perfect for satisfying OLTP lookups for my Favorite Birthday gifts archive website (although perhaps a bit heavy at a single partition a year). What’s important is that we should be able to determine the likes of any out whose birthdays fall within a year. With OLAP we should be able to search globally over all years and we should be able to pushdown and use the birthday clustering key.

Too minimize the noise in the console we will set the log4j settings to WARN on the root logger and info for the connector.

cp conf/log4j.properties.template conf/log4j.properties
// Edit log4j.properties
// Change root logger from INFO to WARN 
// And Add
// log4j.logger.org.apache.spark.sql.cassandra=DEBUG

Let’s start up spark-sql and register this table

./bin/spark-sql \   
    --packages datastax:spark-cassandra-connector:1.6.0-M2-s_2.10 \
    --conf spark.cassandra.connection.host=localhost

CREATE TEMPORARY TABLE common 
    USING org.apache.spark.sql.cassandra 
    OPTIONS ( table "common", keyspace "test");

Now when we query we should get some basic info about how C* handles the pushdowns

spark-sql> select * from common;
16/04/18 17:13:17 INFO CassandraSourceRelation: Input Predicates: []
16/04/18 17:13:17 DEBUG CassandraSourceRelation: Basic Rules Applied:
C* Filters: []
Spark Filters []
16/04/18 17:13:17 DEBUG CassandraSourceRelation: Final Pushdown filters:
C* Filters: []
Spark Filters []
16/04/18 17:13:17 INFO CassandraSourceRelation: Input Predicates: []
16/04/18 17:13:17 DEBUG CassandraSourceRelation: Basic Rules Applied:
C* Filters: []
Spark Filters []
16/04/18 17:13:17 DEBUG CassandraSourceRelation: Final Pushdown filters:
C* Filters: []
Spark Filters []
1985	1985-02-04 16:00:00	74cc615c-05b9-11e6-b512-3e1d05defe78	soccer	Cristiano Ronaldo
1980	1980-10-02 17:00:00	74cc615c-05b9-11e6-b512-3e1d05defe79	twitter	Kim Kardashian

These empty brackets [] let us know there were no predicates that will be handled by spark or C*. For a select * that makes sense, So lets try adding a predicate to find everyone with a birthday before January 1 2001;

spark-sql> select * from common where birthday < '2001-1-1';
16/04/18 17:16:14 INFO CassandraSourceRelation: Input Predicates: []
16/04/18 17:16:14 DEBUG CassandraSourceRelation: Basic Rules Applied:
C* Filters: []
Spark Filters []
16/04/18 17:16:14 DEBUG CassandraSourceRelation: Final Pushdown filters:
C* Filters: []
Spark Filters []
16/04/18 17:16:14 INFO CassandraSourceRelation: Input Predicates: []
16/04/18 17:16:14 DEBUG CassandraSourceRelation: Basic Rules Applied:
C* Filters: []
Spark Filters []
16/04/18 17:16:14 DEBUG CassandraSourceRelation: Final Pushdown filters:
C* Filters: []
Spark Filters []
Time taken: 0.56 seconds                                                                                     

Well that’s odd, no predicates … what’s going on here? Let’s use explain to see what Spark’s plan was for executing this query.

select * from common where birthday < '2001-1-1';
Filter (cast(birthday#1 as string) < 2001-1-1)
+- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@3d3e9163[year#0,birthday#1,userid#2,likes#3,name#4]
Time taken: 0.058 seconds, Fetched 3 row(s)

The first line Filter (cast(birthday#1 as string) < 2001-1-1) tells a sad story. Catalyst took a look at this query and said

'2001-1-1' looks like a string. I'll make birthday a string too! Then I can compare it lexically. 
 Everyone will love me for making this great decision!

Since it has to do a cast operation on the column, Catalyst won’t even let C* know that there is a possible predicate here to compare with. So how do we let Spark know that the literal 2001-1-1 is a date and not just a random string?

What we can do is cast the literal to a timestamp! This will help let catalyst know that it doesn’t have to do any casts on the database data. The predicate is now passed to the Cassandra and no extra filtering has to be done in C*.

spark-sql> select * from common where birthday < cast('2001-1-1' as timestamp);
16/04/18 17:36:15 INFO CassandraSourceRelation: Input Predicates: [LessThan(birthday,2001-01-01 00:00:00.0)]
16/04/18 17:36:15 DEBUG CassandraSourceRelation: Basic Rules Applied:
C* Filters: [LessThan(birthday,2001-01-01 00:00:00.0)]
Spark Filters []
16/04/18 17:36:15 DEBUG CassandraSourceRelation: Final Pushdown filters:
C* Filters: [LessThan(birthday,2001-01-01 00:00:00.0)]
Spark Filters []
16/04/18 17:36:15 INFO CassandraSourceRelation: Input Predicates: [LessThan(birthday,2001-01-01 00:00:00.0)]
16/04/18 17:36:15 DEBUG CassandraSourceRelation: Basic Rules Applied:
C* Filters: [LessThan(birthday,2001-01-01 00:00:00.0)]
Spark Filters []
16/04/18 17:36:15 DEBUG CassandraSourceRelation: Final Pushdown filters:
C* Filters: [LessThan(birthday,2001-01-01 00:00:00.0)]
Spark Filters []
1985	1985-02-04 16:00:00	74cc615c-05b9-11e6-b512-3e1d05defe78	soccer	Cristiano Ronaldo
1980	1980-10-02 17:00:00	74cc615c-05b9-11e6-b512-3e1d05defe79	twitter	Kim Kardashian
Time taken: 0.127 seconds, Fetched 2 row(s)

We also see now that our C* Filters have a new entry showing we have correctly pushed down the timestamp directly to our Cassandra query. We’ll be saving time and avoiding any bad lexical comparisons on timestamps :)

Russell Spitzer's Blog

About

Recent Posts

Elsewhere