In Cassandra, data in a replica can become inconsistent with other replicas due to the distributed nature of the database. Node repair corrects the inconsistencies so that eventually all nodes have the same and most up-to-date data. It is an important part of regular maintenance for every Cassandra cluster.
Cassandra Anti-Entropy Repairs
Anti-entropy repair in Cassandra has two distinct phases. To run successful, performant repairs, it is important to understand both of them.
Merkle Tree calculations: This computes the differences between the nodes and their replicas.
Data streaming: Based on the outcome of the Merkle Tree calculations, data is scheduled to be streamed from one node to another. This is an attempt to synchronize the data between replicas.
How to Repair Data in Cassandra?
Cassandra provides the nodetool repair tool to ensure data consistency across replicas; it compares the data across all replicas and then updates the data to the most recent version.
Default Repair Option
$ nodetool repair
This command will repair the current node's primary token range (i.e. range which it owns) along with the replicas of other token ranges it has in all tables and all keyspaces on the current node.
For e.g. If we have replication factor of 3 then total of 5 nodes will be involved in repair: 2 nodes will be fixing 1 partition range 2 nodes will be fixing 2 partition ranges 1 node will be fixing 3 partition ranges. (Command was run on this node)
Repair Primary Token Range
This command repairs only the primary token range of the node in all tables and all keyspaces on the current node:
$ nodetool repair -pr
Repair only the local Data Center on which the node resides
$ nodetool repair -pr -local
Repair only the primary range for all replicas in all tables and all keyspaces on the current node, only by streaming from the listed nodes:
$ nodetool repair -pr -hosts 192.168.0.2, 192.168.0.3, 192.168.0.4
Repair only the primary range for all replicas in the example_ks keyspace on the current node:
$ nodetool repair -pr -- example_ks
Repair only the primary range for all replicas in the test_users table of the example_ks keyspace on the current node.
$ nodetool repair -pr -- example_ks test_users
How to Check Repair Status?
We can check for the first phase of repair (Merkle Tree calculations) by checking nodetool compactionstats.
We can check for repair streams using nodetool netstats. Repair streams will also be visible in our logs. We can grep for them in your system logs like this:
$ grep Entropy system.log INFO [AntiEntropyStage:1] 2016-07-25 07:32:47,077 RepairSession.java (line 164) [repair #70c35af0-526e-11e6-8646-8102d8573519] Received merkle tree for test_users from /192.168.14.3 INFO [AntiEntropyStage:1] 2016-07-25 07:32:47,081 RepairSession.java (line 164) [repair #70c35af0-526e-11e6-8646-8102d8573519] Received merkle tree for test_users from /192.168.16.5 INFO [AntiEntropyStage:1] 2016-07-25 07:32:47,091 RepairSession.java (line 221) [repair #70c35af0-526e-11e6-8646-8102d8573519] test_users is fully synced INFO [AntiEntropySessions:4] 2016-07-25 07:32:47,091 RepairSession.java (line 282) [repair #70c35af0-526e-11e6-8646-8102d8573519] session completed successfully
Active repair streams can also be monitored with this (Bash) command:
$ while true; do date; diff <(nodetool -h 192.168.0.1 netstats) <(sleep 5 && nodetool -h 192.168.0.1 netstats); done
How to troubleshoot Repair in Cassandra?
On each node, we can monitor this with nodetool tpstats, and check for anything "blocked" on the "AntiEntropy" lines.
$ nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ... AntiEntropyStage 0 0 854866 0 0 ... AntiEntropySessions 0 0 2576 0 0 ...
Understanding Primary Range
When thinking about repair in Apache Cassandra, we need to think in terms of token ranges not tokens. Lets start with the simple case of having one data center with 10 nodes and 100 tokens. Where node N0 is assigned token 0, N1 is assigned token 10, and so on.
With a replication factor of 3, N3 will own data for tokens 1–30. And if we look at where that data will be replicated, we get range 1–10 shared by N1, N2, N3; 11–20 shared by N2, N3, N4; 21–30 shared by N3, N4, N5.
So in this case, the primary range for N3 is 21-30. Rang 1-10 and Rang 11-20 are all replicas.
When should we Run Repair
If we use “-pr” on just one node, or just the nodes in one data center, we will only repair a subset of the data on those nodes. We can use this way to repair the entire cluster to run it on EVERY node in EVERY data center. This can help us avoid repair the same data multiple times.
When running repair to fix a problem, like a node being down for longer than the hint windows, we need to repair the entire token range of that node. So we can’t just run “nodetool repair -pr” on it. We need to initiate a full “nodetool repair” on it, or do a full cluster repair with “nodetool repair -pr”.
Nodetool Repair Steps
Repeat for every table:
1) RepairJob.java Request merkle trees
2) RepairSession.java Receive merkle trees
3) Differencer.java Check for inconsistencies
4) OptionalStreamingRepairTask.java Stream differences--if any
5) RepairSession.java is fully synced
6) StorageService.java Repair session for range (,] finished