Cassandra timeout issue is a complex issue. Today we will review what read timeout mean in Cassandra and how to fix it.
what does Read timeout mean in Cassandra cluster?
A read request reached the coordinator, which initially believed that there were enough live replicas to process it. But, for some reason, one or several replicas were too slow to answer within the predefined timeout (read_request_timeout_in_ms in cassandra.yaml), and the coordinator replied to the client with a READ_TIMEOUT error.
This could be due to temporary overloading of these replicas, or even that they just failed or were turned off. During reads, Cassandra doesn’t request data from every replica to minimize internal network traffic; instead, some replicas are only asked for a checksum of the data. A read timeout may occur even if enough replicas responded to fulfill the consistency level, but only checksum responses were received (the method’s dataRetrieved parameter allow you to check if you’re in that situation).
If the policy rethrows the error, the user code will get a ReadTimeoutException.
How to fix Cassandra Read timeout issue
We got this problem in our env. It was related to network. During the issue time, we can see that there was some ping packets loss. The interconnect network was impacted by this. The Cassandra node got a read timeout at that time.
During the issue time, the tcp retransmission increased a lot. That also indicated that something was wrong at the network level.
It turned out that one broken SFP on the switch side caused this problem.
In order to get a better Cassandra env, we add the following metrics to our monitor.
We use telegraf + influx db + granfa monitor our Cassandra cluster.
- read timeout/write timeout count
- percent of ping packet loss/ping latency
- tcp retransmission