14 Apache Cassandra best Practices for developers & Application Teams
Updated: Nov 6, 2020
No Load balancers in front of cassandra
Cassandra distributes the data across the nodes and most of the cassandra drivers have the algorithm built in to direct requests appropriately . Adding load balancer introduces an additional layer , potentially breaks intelligent algorithms used by driver and also introduces a single point of failure where there is none in cassandra world.
Avoid secondary index whenever possible
In Cassandra secondary index are local. Hence queries using secondary index can cause performance issues in cluster if multiple nodes are accessed . Using secondary index occasionally on a low cardinality column (for e.g. a column with few hundreds of unique state names) is okay but on general try to avoid it. Do not use it for high cardinality columns .
Avoid full table scans
Cassandra distributes the partitions among all the nodes in the cluster. In a large cluster it can be billion of rows .Hence a full table scan involving all records can cause network bottlenecks and extreme heap pressure. Instead tweak your data model such that you do not need to do full table scans.
Keep partition sizes within 100MB
Maximum practical limit is two billion cells per partition . However , you should try to keep the size of your partitions within 100MB . Very large partitions create considerable pressure on heap and can cause slow down of compaction as well.
Do not use batch for bulk loading
Batches are to be used when you want to keep a set of denormalized set of tables in sync. Do not use batch for bulk loading (especially when multiple partition keys are involved) because it can put significant pressure on the coordinator node and detrimental for performance.
Take advantage of prepared statements when possible
Use prepared statement when you are executing a query with same structure multiple times . Cassandra will parse the query string and cache the result . Next time you want the query you can just bind the variables with cached prepared statements . It helps in increasing the performance by skipping the parsing phase for each and every query .
Avoid using IN clause queries with large number of values for multiple partitions
Using 'in' clause queries with large numbers for multiple partitions puts significant pressure on the coordinator node. Also if the coordinator node fails to process that query due to excessive load , the whole thing have to be retried . Instead prefer to use separate queries in these cases to avoid single failure points and overheating the coordinator node.
Do not use SimpleReplicationStrategy in multi-datacenter environments
Simple strategy for replication is used for single datacenter environments .It places the replica in the next nodes clockwise without considering the rack or datacenter location . Hence it should not be used for multi-datacenter environments.
Prefer to use local consistency levels in multi-datacenter environments
If possible and the use case permits always prefer to use local consistency levels in multi-datacenter environments. With local consistency level , local replicas are consulted for preparing a response and this avoids latency of inter-datacenter communication .
Avoid queue like data models
These kind of data models generate a lot of tombstones . A slice query which scans through a lot of tombstones for finding a match is sub optimal . It causes latency issues and also increases heap pressure because it scans through a lot of grabage data for finding a small amount of required data .
Avoid reading before writing pattern
Reads for uncached data may need multiple disk hits . Hence this decreases the throughput of writes considerably which does sequential i/o.
Try to keep the total number of tables in a cluster within a reasonable limit
Large number of tables inside a cluster can cause excessive heap and memory pressure . So try to keep the number of tables in a cluster within a reasonable limit . Due to multiple factors involved with tables its difficult to find a good number but from many tests it has been established that you should try to keep the number of tables within 200 (warning level) and you absolutely should not cross 500 tables(failure level).
Choose Leveled compaction Strategy for read heavy work load if sufficient i/o is available
Assuming nearly uniform rows, leveled compaction strategy guarantees that 90% of all reads are satisfied from a single sstable. Hence it is great for read heavy and latency sensitive use cases . Of course it causes more compaction and hence requires more i/o during compaction.
Note: Its always good to use leveled compaction strategy while creating the table itself . Once the table itself is created , it becomes bit tricky to change it later . Although it can be changed later , do it with caution as it can overload your node with lot of i/o .
Keep your multiple-partition batch size within 5KB
Large batches can cause siginificant performance penalty by overloading the coordinator node . So try to keep your batch size as 5KB per batch when using multiple-partition batch.