Intro
For some time now spring data with cassandra is getting more and more popular. My main concern with the framework is performance characteristics when compared to native cql driver. After all with the driver everything is under your control and one can probably squeeze much more juice out of cluster. O.k. I admit it's not always about performance. If that would be the case we would all be writing software in C or assembler. But still I think it's a good practice to be aware of the drawbacks.
To be honest spring data cassandra is relatively new to me. I did the performance comparison on the lowest level without using repositories and other high level concepts that come with spring data cassandra. My focus in this post is more on the generics that decode the data that comes out from the driver. To make a comparison I'm going to use a simple cassandra table (skinny row), then I'm going to make query after query (5000 and 10000) towards cassandra and after that I'll decode results. Once again the focus in this post is not on performance characteristics of higher order functionalities like paged queries etc. I just wanted to know by a rule of thumb what can I expect from spring data cassandra.
Setup
-- simple skinny row
CREATE TABLE activities (
activity_id uuid,
activity_model_id bigint,
activity_state text,
asset_id text,
attrs map<text, text>,
creation_time timestamp,
customer_id text,
end_time timestamp,
last_modified_time timestamp,
person_id text,
poi_id text,
start_time timestamp,
PRIMARY KEY (activity_id)
);
To eliminate all possible effects, I just used single skinny row:
activity_id 72b493f0-e59d-11e3-9bd6-0050568317c1
activity_model_id 66
activity_state DONE
asset_id 8400848739855200000
attrs {
'businessDrive': '1:1',
'customer': '4:test_test_test',
'distance': '3:180',
'endLocation': '6:15.7437466839,15.9846853333,0.0000000000',
'fromAddress': '4:XX1',
'locked': '1:0',
'reason': '4:Some reason 2',
'startLocation':
'6:15.7364385831,15.0071729736,0.0000000000',
'toAddress': '4:YY2'
}
creation_time 2014-05-27 14:50:14+0200
customer_id 8400768435301400000
end_time 2014-05-27 12:15:40+0200
last_modified_time 2014-05-29 21:30:44+0200
person_id 8401111750365200000
poi_id null
start_time 2014-05-27 12:13:05+0200
This row is fetched every time, to detect differences We'll see how long the iterations last. Network
and cluster is also out of scope so everything was tested on local running datastax cassandra community (2.0.16) instance.
The code
To separate all possible interfering effects I used two separate projects. I had a situation where
I used an old thrift api together with cql driver and it significantly affected performance. And it required
additional configuration parameters etc. The main code
snippets are located on gist. This is not the focus here, but if somebody is interested:
spring-data
native-drivers
Results in milliseconds
3 fields - 5000 items
spring-data
5381
5282
5385
avg: 5339
driver
4426
4280
4469
avg: 4390
result: driver faster 21.6%
3 fields - 10000 items
spring-data
8560
8133
8144
avg: 8279
driver
6822
6770
6875
avg: 6822
result: driver faster 21.3%
12 fields - 5000 items
spring-data
5911
5920
5928
avg: 5920 - 10.88 % slower than with 3 fields!
driver
4687
4669
4606
avg: 4654 - 6 % slower than with 3 fields
result: driver faster 27%
Conclusions
Spring data cassandra may be very interesting if you are interested to learn something new. It might also have very positive development effects when prototyping or doing something similar. I didn't test the higher order functionalities like pagination etc. This was just a rule of a thumb test to see what to expect. Basically the bigger the classes that you have to decode the bigger the deserialization cost. At least this is the effect I'm noticing in my basic tests.
Follow up with Object Mapping available in Cassandra driver 2.1
There was an interesting follow up disuccion on reddit. By a proposal from reddit user v_krishna another candidate was added to comparison Object-mapping API.
Let's see the results:
3 fields - 5000 items
spring-data
5438
5453
5576
avg: 5489
object-map
5390
5299
5476
avg: 5388
driver
4382
4410
4249
avg: 4347
conclusion
- driver 26% faster than spring data
- object map just under 2% faster than spring data
3 fields - 10000 items
spring-data
8792
8507
8473
avg: 8591
object-map
8435
8494
8365
avg: 8431
driver
6632
6760
6646
avg: 6679
conclusion
- driver faster 28.6% than spring data
- object mapping just under 2% faster than spring data
12 fields 5000 items
spring-data
6193
5999
5938
avg: 6043
object-map
6062
5936
5911
avg: 5970
driver
4910
4955
4596
avg: 4820
conclusion
- driver 25% faster than spring data
- object mapping 1.2% faster than spring data
To keep everything fair, there was some deviation in test runs when compared to previous test, here are deviations:
comparison with first run:
3 fields - 5000 items
spring-data
avg1: 5339
avg2: 5489
2.7% deviation
driver
avg1: 4390
avg2: 4347
1% deviation
3 fields - 10000 items
spring-data
avg1: 8279
avg2: 8591
3.6% deviation
driver
avg1: 6822
avg2: 6679
2.1% deviation
12 fields 5000 items
spring-data
avg1: 5920
avg2: 6043
2% deviation
driver
avg1: 4654
avg2: 4820
3.4% deviation
Object mapping from spring data seems to be just a bit slower then object mapping available in new driver. I can't wait
to see the comparison of two in future versions. Initially I was expecting around 5-10% percent worse performance when
compared to object mapping capabilities. It surprised me a bit that the difference was more on the level of 25%. So if
you are planning on using object mapping capabilities there is a performance penalty.
No comments:
Post a Comment