Troubleshoot Apache Cassandra timeout issue in Linux
Updated: Nov 6
Cassandra timeout during read query at consistency LocalQuorum (1 replica(s) responded over 2 required)
what does timeout mean in Cassandra cluster?
A read request reached the coordinator, which initially believed that there were enough live replicas to process it. But, for some reason, one or several replicas were too slow to answer within the predefined timeout (read_request_timeout_in_ms in cassandra.yaml), and the coordinator replied to the client with a READ_TIMEOUT error.
This could be due to a temporary overloading of these replicas, or even that they just failed or were turned off. During reads, Cassandra doesn’t request data from every replica to minimize internal network traffic; instead, some replicas are only asked for a checksum of the data. A read timeout may occur even if enough replicas responded to fulfill the consistency level, but only checksum responses were received (the method’s dataRetrieved parameter allow you to check if you’re in that situation).
If the policy rethrows the error, the user code will get a ReadTimeoutException.
How to fix timeout issue:
We got this problem in our env. It was mostly related to network. During the issue time, we can see that there was some ping packet loss. The interconnect network was impacted by this. So the Cassandra node had a read request timeout at that time. If there were more nodes which had network problem, that would impact the entire Cassandra cluster.
During the issue time, we also noticed that the tcp retransmission increased a lot. That also indicated that something was wrong at network level.
Actually, this issue happened many times but it lasted very short every time. We checked this with network team again and again. It turned out that the STP on switch side caused this problem.
In order to get a better Cassandra env, we setup some more monitors based on this problem. We use telegraf + influx db + granfa monitor our Cassandra cluster.
read timeout/write timeout count
percent of ping packet loss/ping latency