27 Cassandra Best Practices for Administrators / DBA team
Updated: Nov 6, 2020
Decrease the default read ahead values in unix/linux systems
Most of the modern unix/linux systems use read ahead so that subsequent reads will be cached in to memory . This provides a performance benefit for systems that read data sequentially (e.g. Kafka) .However, in cassandra most normal read operations are not sequential and hence read ahead wastes a lot of memory and i/o . Hence it should be decreased considerably . 8 KB is a safe value to use.
Prefer local storage
Always prefer to use local storage when it is available to you . Some of the cloud providers are coming up with dedicated iops EBS blocks which is okay to use but do extreme testing in those conditions to make sure that your latency requirements are met . Anytime the storage is not locally attached it adds extra latency and can quickly wipe off any benefits of tunings. Do not use SAN/NAS.
Use ntp to sync the times in the servers
Cassandra associates a timestamp with a cell when it is written/updated/deleted. Because the latest update wins during a read , syncing the time across the cassandra servers is crucial. Install ntp in your servers for absolute time synchronization. Also check your ntp.conf file to make sure that the nodes are sync to the server you wanted them to sync with . With a default installation they are are usaully synced with a public pool of servers . A handy way to check the offset and jitter is to use the command ntpq -p
Do not use simple snitch in multi datacenter production deployments
Simple snitch is the default snitch . It should not be used when doing multi-datacenter production deployments as it does not recognize the datacenter and rack information .5
Make sure that the default user name and password are changed for superuser
The default super user name and password combination should be changed to prevent any security exploitation
Always have the auto snapshot feature enabled in production environment
In order to prevent data loss have the auto snapshot feature as default . During the keyspace truncation or drop operations it takes a snapshot automatically and hence strongly advised in order to prevent any impact from unintended drop or truncate operation.
Enable JNA (Java Native access)
JNA can improve the memory usage and native disk access . Hence should be installed and enabled if not already by your installed package
For production cluster make effort to store the opscenter metrics data in a separate cluster
OpsCenter stores the metrics as cassandra keyspace and tables. Hence the metrics data has to go through the same life-cycle as the application data and hence competes with application data for all the resources. Hence in order to prevent any unintended impact on the application data , opscenter metrics data should be stored separately. Refer OpsCenterMetricsData-BestPractices for more details.
Provide sufficient amount of disk overhead for compaction
As a generalized rule, about 50% free disk space and 10% free disk space should be allowed for STCS and LCS respectively. The concept here is that during a compaction both the old sstable and new sstable can co-exist. Hence you need 50% free disk space in worst case scenario.In reality its rare that all the sstable compactions happen at once . Hence consider 50% free disk space as a soft target (not a hard target).
Start your cassandra cluster with at least three replicas
Minimum three nodes in a datacenter are recommended for a cassandra cluster . With less than three nodes you will loose many of the benefits cassandra offers. For e.g. There is no advantage of using quorum consistency level because quorum of two is two ( quorum is calculated as RF/2 + 1 rounded down to nearest whole number ). Essentially this means that cassandra can't tolerate any node failures.
Repair your cluster at least once within the gc grace period
Repairing the cluster once within the gc grace period helps in propagating the deletes to all the replicas . This prevents data resurrection in cassandra in case a unresponsive node was not able to process the delete at the time of operation. Repair also helps if fixing inconsistency issues (if there are any).
Add at least two seed nodes per data center in a multi-datacenter environment
More than one seed node in a local datacenter enables the new nodes to contact a local node for cluster topology discovery in case one of the seed nodes is down in the same datacenter.
Do not use -Xmn flag while using G1 gc
If Xmn flag is set then it will impact the adjusting mechanisms of G1 gc based on the load . G1 gc will no longer able to expand/contract the young generation size and also will not respect the pause time goals .
Disable swap memory
Swapping can cause high latency in cassandra . Hence disable it to avoid latency spike . If the system is low on memory ever then its better for it to get killed rather than serving request with latency degrdation. Because cassandra has multiple replicas available the requests will be sent to the other healthy replicas. In addition cassandra already puts pressure on disk i/o . So we do not want additional pressure during swap operation.
Increase the replication factor for the system_auth keyspace
It provides fault tolerance and prevents inaccessibility to the system in case node containing the security tables is down.
Change the TCP settings to suit cassandra and make them persist after the reboot
Below are the modified settings those should be put in sysctl.conf file . Once added you can reload the settings using 'sudo sysctl -p /etc/sysctl.conf' . These settings mostly help in handling a lot of concurrent connections and prevent idle connection timeouts between nodes during low traffic time.
net.ipv4.tcp_keepalive_time=60 #interval between last data packet sent and first keep alive probe net.ipv4.tcp_keepalive_probes=3 # number of unacknowledged probe to send before considering the connection dead net.ipv4.tcp_keepalive_intvl=10 # interval between subsequent keepalive probes net.core.rmem_max=16777216 # Maximum operating system receive buffer size for connections net.core.wmem_max=16777216 # Maximum operating system send buffer size for connections net.core.rmem_default=16777216 # Default operating system receive buffer size for connections net.core.wmem_default=16777216 # Default operating system send buffer size for connections net.core.optmem_max=40960 # Option memory buffers net.ipv4.tcp_rmem=4096 87380 16777216 # Minimum receive buffer for each TCP connection, Default receive buffer allocated for each TCP scoket, Maximum receive buffer for a TCP socket net.ipv4.tcp_wmem=4096 65536 16777216 # Minimum send buffer for each TCP connection, Default send buffer allocated for each TCP scoket, Maximum send buffer for a TCP socket
Try to keep the total number of tables in a cluster within a reasonable limit
Large number of tables inside a cluster can cause excessive heap and memory pressure . So try to keep the number of tables in a cluster within a reasonable limit . Due to multiple factors involved with tables its difficult to find a good number but from many tests it has been established that you should try to keep the number of tables within 200 (warning level) and you absolutely should not cross 500 tables(failure level).
Set specific user resource limits for cassandra
The user resource limits should be relaxed for cassandra , because we almost always use dedicated instances for cassandra. Below configurations can be modified/added in cassandra-limits.conf or limits.conf file under /etc/security/
<cassandra_user> - memlock unlimited # Unlimited amount of locked-in memory address space for cassandra user <cassandra_user> - nofile 100000 # Increase value for maximum number of open file descriptors for cassandra user <cassandra_user> - nproc 32768 # Increase the limit for maximum number for processes for cassandra user <cassandra_user> - as unlimited # Unlimited address space limit for cassandra user
Keep commitlog in a separate drive
If not using SSD , then keep the data and commilog in separate disk drives . This helps in faster write performance by minimizing the disk head movement.
Prefer a network of bandwidth 1 gigabit or more bandwidth
At minimum, use a network with 1 gigabit capacity . Cassandra can easily saturate a 1 gigabit network during write and repair operations . For high perfomance larger clusters use a 10 gigabit network.
Disable CPU frequency scaling
On linux systems CPU frequency scaling allows dynamic adjustment of clock speeds based on the load on the server. This is not good for cassandra as this can cap the throughput . Hence disable CPU frequency scaling by using the performance governor . It helps in locking the frequcny at maximum possible level.
Do not use byte order partitioner
The byte order partitioner can provide ordered partitioning . Hence it allows ordered scans by primary key . Hence it can be tempting to have this feature available . However its not recommened to use this because it can create hot spots while writing sequentially . It also causes load balancing problems.
In NUMA systems ,zone_reclaim_mode allows different approaches to reclaim memory when a zone runs out of memory. Its recommended that this should be turned off in cassandra set up, as kernel has shown some inconsistency in handling this thus by causing performance problems like random CPU spikes, programs hanging etc.
Disable defragmentation for huge pages
During defragmentation of huge pages memory need to be locked. Most of the times it becomes visible to the program while the memory is being moved for defragmentation and hence causes performance issues for the program . Hence its recommended that defragmentation should be turned off for huge pages.
Do not use G1 GC when on Java 7
Do not use G1 GC in combination with java 7 . G1 gc was first implemented in java 7 and can be considered bit immature in that version . There are known issues like issues with class unloading when G1 gc used with java 7 . Hence to be on safe side do not use G1 gc if you are on java 7 . Upgrade java first(to JDK 8u40) if you want to switch to G1 gc.
Optimize your available SSDs
Set the SysFS rotational flag to false and set the IO scheduler to either deadline or noop. Setting SysFS to false ensures that the drive is considered SSD by the OS . Choosing deadline or noop helps in IO optimization . Use noop when using an array of SSDs with high-end IO controller. Use deadline as a safe option when in doubt.28
Keep the maximum and minimum heap size same
Set the minimum and maximum heap size to same value in order to avoid heap resize as that can cause more pauses than normal due to resize activity.