Intro
For some time now spring data with cassandra is getting more and more popular. My main concern with the framework is performance characteristics when compared to native cql driver. After all with the driver everything is under your control and one can probably squeeze much more juice out of cluster. O.k. I admit it's not always about performance. If that would be the case we would all be writing software in C or assembler. But still I think it's a good practice to be aware of the drawbacks.
To be honest spring data cassandra is relatively new to me. I did the performance comparison on the lowest level without using repositories and other high level concepts that come with spring data cassandra. My focus in this post is more on the generics that decode the data that comes out from the driver. To make a comparison I'm going to use a simple cassandra table (skinny row), then I'm going to make query after query (5000 and 10000) towards cassandra and after that I'll decode results. Once again the focus in this post is not on performance characteristics of higher order functionalities like paged queries etc. I just wanted to know by a rule of thumb what can I expect from spring data cassandra.
Setup
-- simple skinny row CREATE TABLE activities ( activity_id uuid, activity_model_id bigint, activity_state text, asset_id text, attrs map<text, text>, creation_time timestamp, customer_id text, end_time timestamp, last_modified_time timestamp, person_id text, poi_id text, start_time timestamp, PRIMARY KEY (activity_id) );To eliminate all possible effects, I just used single skinny row:
activity_id 72b493f0-e59d-11e3-9bd6-0050568317c1 activity_model_id 66 activity_state DONE asset_id 8400848739855200000 attrs { 'businessDrive': '1:1', 'customer': '4:test_test_test', 'distance': '3:180', 'endLocation': '6:15.7437466839,15.9846853333,0.0000000000', 'fromAddress': '4:XX1', 'locked': '1:0', 'reason': '4:Some reason 2', 'startLocation': '6:15.7364385831,15.0071729736,0.0000000000', 'toAddress': '4:YY2' } creation_time 2014-05-27 14:50:14+0200 customer_id 8400768435301400000 end_time 2014-05-27 12:15:40+0200 last_modified_time 2014-05-29 21:30:44+0200 person_id 8401111750365200000 poi_id null start_time 2014-05-27 12:13:05+0200This row is fetched every time, to detect differences We'll see how long the iterations last. Network and cluster is also out of scope so everything was tested on local running datastax cassandra community (2.0.16) instance.
The code
To separate all possible interfering effects I used two separate projects. I had a situation where
I used an old thrift api together with cql driver and it significantly affected performance. And it required
additional configuration parameters etc. The main code
snippets are located on gist. This is not the focus here, but if somebody is interested:
spring-data
native-drivers
Results in milliseconds
3 fields - 5000 items spring-data 5381 5282 5385 avg: 5339 driver 4426 4280 4469 avg: 4390 result: driver faster 21.6% 3 fields - 10000 items spring-data 8560 8133 8144 avg: 8279 driver 6822 6770 6875 avg: 6822 result: driver faster 21.3% 12 fields - 5000 items spring-data 5911 5920 5928 avg: 5920 - 10.88 % slower than with 3 fields! driver 4687 4669 4606 avg: 4654 - 6 % slower than with 3 fields result: driver faster 27%
Conclusions
Spring data cassandra may be very interesting if you are interested to learn something new. It might also have very positive development effects when prototyping or doing something similar. I didn't test the higher order functionalities like pagination etc. This was just a rule of a thumb test to see what to expect. Basically the bigger the classes that you have to decode the bigger the deserialization cost. At least this is the effect I'm noticing in my basic tests.
Follow up with Object Mapping available in Cassandra driver 2.1
There was an interesting follow up disuccion on reddit. By a proposal from reddit user v_krishna another candidate was added to comparison Object-mapping API.
Let's see the results:
3 fields - 5000 items spring-data 5438 5453 5576 avg: 5489 object-map 5390 5299 5476 avg: 5388 driver 4382 4410 4249 avg: 4347 conclusion - driver 26% faster than spring data - object map just under 2% faster than spring data 3 fields - 10000 items spring-data 8792 8507 8473 avg: 8591 object-map 8435 8494 8365 avg: 8431 driver 6632 6760 6646 avg: 6679 conclusion - driver faster 28.6% than spring data - object mapping just under 2% faster than spring data 12 fields 5000 items spring-data 6193 5999 5938 avg: 6043 object-map 6062 5936 5911 avg: 5970 driver 4910 4955 4596 avg: 4820 conclusion - driver 25% faster than spring data - object mapping 1.2% faster than spring data
To keep everything fair, there was some deviation in test runs when compared to previous test, here are deviations:
comparison with first run: 3 fields - 5000 items spring-data avg1: 5339 avg2: 5489 2.7% deviation driver avg1: 4390 avg2: 4347 1% deviation 3 fields - 10000 items spring-data avg1: 8279 avg2: 8591 3.6% deviation driver avg1: 6822 avg2: 6679 2.1% deviation 12 fields 5000 items spring-data avg1: 5920 avg2: 6043 2% deviation driver avg1: 4654 avg2: 4820 3.4% deviationObject mapping from spring data seems to be just a bit slower then object mapping available in new driver. I can't wait to see the comparison of two in future versions. Initially I was expecting around 5-10% percent worse performance when compared to object mapping capabilities. It surprised me a bit that the difference was more on the level of 25%. So if you are planning on using object mapping capabilities there is a performance penalty.
No comments:
Post a Comment