2015-11-07

Spring Data Cassandra vs. Native Driver

Intro

For some time now spring data with cassandra is getting more and more popular. My main concern with the framework is performance characteristics when compared to native cql driver. After all with the driver everything is under your control and one can probably squeeze much more juice out of cluster. O.k. I admit it's not always about performance. If that would be the case we would all be writing software in C or assembler. But still I think it's a good practice to be aware of the drawbacks.

To be honest spring data cassandra is relatively new to me. I did the performance comparison on the lowest level without using repositories and other high level concepts that come with spring data cassandra. My focus in this post is more on the generics that decode the data that comes out from the driver. To make a comparison I'm going to use a simple cassandra table (skinny row), then I'm going to make query after query (5000 and 10000) towards cassandra and after that I'll decode results. Once again the focus in this post is not on performance characteristics of higher order functionalities like paged queries etc. I just wanted to know by a rule of thumb what can I expect from spring data cassandra.

Setup

    -- simple skinny row
    CREATE TABLE activities (
        activity_id uuid,
        activity_model_id bigint,
        activity_state text,
        asset_id text,
        attrs map<text, text>,
        creation_time timestamp,
        customer_id text,
        end_time timestamp,
        last_modified_time timestamp,
        person_id text,
        poi_id text,
        start_time timestamp,
        PRIMARY KEY (activity_id)
    );

    
To eliminate all possible effects, I just used single skinny row:
    activity_id 72b493f0-e59d-11e3-9bd6-0050568317c1
    activity_model_id 66
    activity_state DONE
    asset_id 8400848739855200000
    attrs {
        'businessDrive': '1:1',
        'customer': '4:test_test_test',
        'distance': '3:180', 
        'endLocation': '6:15.7437466839,15.9846853333,0.0000000000',
        'fromAddress': '4:XX1', 
        'locked': '1:0', 
        'reason': '4:Some reason 2', 
        'startLocation': 
        '6:15.7364385831,15.0071729736,0.0000000000', 
        'toAddress': '4:YY2'
        }
    creation_time 2014-05-27 14:50:14+0200
    customer_id 8400768435301400000
    end_time 2014-05-27 12:15:40+0200
    last_modified_time 2014-05-29 21:30:44+0200
    person_id 8401111750365200000
    poi_id null
    start_time 2014-05-27 12:13:05+0200
    
This row is fetched every time, to detect differences We'll see how long the iterations last. Network and cluster is also out of scope so everything was tested on local running datastax cassandra community (2.0.16) instance.

The code

To separate all possible interfering effects I used two separate projects. I had a situation where I used an old thrift api together with cql driver and it significantly affected performance. And it required additional configuration parameters etc. The main code snippets are located on gist. This is not the focus here, but if somebody is interested:

spring-data
native-drivers

Results in milliseconds

    3 fields - 5000 items
        spring-data
        5381
        5282
        5385
        avg: 5339

        driver
        4426
        4280
        4469
        avg: 4390

        result: driver faster 21.6%

    3 fields - 10000 items
        spring-data
        8560
        8133
        8144
        avg: 8279

        driver
        6822
        6770
        6875
        avg: 6822
        
        result: driver faster 21.3%

    12 fields - 5000 items
        spring-data
        5911
        5920
        5928
        avg: 5920 - 10.88 % slower than with 3 fields!

        driver
        4687
        4669
        4606
        avg: 4654 - 6 % slower than with 3 fields

        result: driver faster 27%

Conclusions

Spring data cassandra may be very interesting if you are interested to learn something new. It might also have very positive development effects when prototyping or doing something similar. I didn't test the higher order functionalities like pagination etc. This was just a rule of a thumb test to see what to expect. Basically the bigger the classes that you have to decode the bigger the deserialization cost. At least this is the effect I'm noticing in my basic tests.

Follow up with Object Mapping available in Cassandra driver 2.1

There was an interesting follow up disuccion on reddit. By a proposal from reddit user v_krishna another candidate was added to comparison Object-mapping API.

Let's see the results:

    3 fields - 5000 items
        spring-data
        5438
        5453
        5576
        avg: 5489

        object-map
        5390
        5299
        5476
        avg: 5388

        driver
        4382
        4410
        4249
        avg: 4347

    conclusion
        - driver 26% faster than spring data
        - object map just under 2% faster than spring data

    3 fields - 10000 items
        spring-data
        8792
        8507
        8473
        avg: 8591

        object-map
        8435
        8494
        8365
        avg: 8431

        driver
        6632
        6760
        6646
        avg: 6679

    conclusion
        - driver faster 28.6% than spring data
        - object mapping just under 2% faster than spring data

    12 fields 5000 items
        spring-data
        6193
        5999
        5938
        avg: 6043

        object-map
        6062
        5936
        5911
        avg: 5970

        driver
        4910
        4955
        4596
        avg: 4820

    conclusion
        - driver 25% faster than spring data
        - object mapping 1.2% faster than spring data

To keep everything fair, there was some deviation in test runs when compared to previous test, here are deviations:

comparison with first run:
    3 fields - 5000 items
        spring-data
        avg1: 5339
        avg2: 5489
        2.7% deviation

        driver
        avg1: 4390
        avg2: 4347
        1% deviation

    3 fields - 10000 items
        spring-data
        avg1: 8279
        avg2: 8591
        3.6% deviation

        driver
        avg1: 6822
        avg2: 6679
        2.1% deviation

    12 fields 5000 items
        spring-data
        avg1: 5920
        avg2: 6043
        2% deviation

        driver
        avg1: 4654
        avg2: 4820
        3.4% deviation
Object mapping from spring data seems to be just a bit slower then object mapping available in new driver. I can't wait to see the comparison of two in future versions. Initially I was expecting around 5-10% percent worse performance when compared to object mapping capabilities. It surprised me a bit that the difference was more on the level of 25%. So if you are planning on using object mapping capabilities there is a performance penalty.

No comments: