How to backup Cassandra database ? 14348 Ratings
Updated: Dec 24, 2020
Apache Cassandra supports two kinds of backup strategies.
A snapshot is a copy of a table’s SSTable files at a given time, created via hard links. The DDL to create the table is stored as well. Snapshots may be created by a user or created automatically. The setting (snapshot_before_compaction) in cassandra.yaml determines if snapshots are created before each compaction. By default snapshot_before_compaction is set to false. Snapshots may be created automatically before keyspace truncation or dropping of a table by setting auto_snapshot to true (default) in cassandra.yaml. Truncates could be delayed due to the auto snapshots and another setting in cassandra.yaml determines how long the coordinator should wait for truncates to complete. By default Cassandra waits 60 seconds for auto snapshots to complete.
An incremental backup is a copy of a table’s SSTable files created by a hard link when memtables are flushed to disk as SSTables. Typically incremental backups are paired with snapshots to reduce the backup time as well as reduce disk space. Incremental backups are not enabled by default and must be enabled explicitly in cassandra.yaml (with incremental_backups setting) or with the Nodetool. Once enabled, Cassandra creates a hard link to each SSTable flushed or streamed locally in a backups/ subdirectory of the keyspace data. Incremental backups of system tables are also created.
Data Directory Structure
Using the nodetool snapshot command on a node will result in a new folder being created in the ../snapshots/ folder in any live data directories on that node for any tables in the snapshot. The new folder will be named with Unix timestamp by default, but can be tagged as an option to the command. The tag is useful for organizing the snapshots.
Inside that folder, hardlinks to the sstable files will be created. So, this snapshot subfolder will contain a complete image of the persisted state of the data, at the point in time of the snapshot.
If you want to get the complete state of the data persisted before the snapshot is taken, you can run nodetool flush to force a flush on one or more tables, emptying the commitlogs and making any data in them into sstables. These folders or the sstables in them can then be copied to remote storage easily using your favorite copy tool such as rsync or scp.
As mentioned previously, make sure that snapshots from different nodes are stored separately. A structure in remote storage with a parent folder for each node, with branches for keyspaces and tables going downward from there, in each node’s folder in the remote directory, is a good way to keep them separate.
Taking incremental backups
Incremental backups can be turned on in Cassandra. Enabling it means whenever a commitlog is flushed, through normal operations or through a command, which results in a new sstable in the live data directory for a table, a duplicate of that sstable will be created in the ../<keyspace>/backups/ folder for each keyspace. You must maintain and clean up these backups, since the database doesn’t know about them, just like snapshots.
The incremental backups and the snapshots are both hardlinks to the live data sstables, so, once you have a snapshot of a keyspace, you can clear the contents of the incremental backups folder for that keyspace - all the files in it will be in the keyspace, and it might also have older files if you don’t clean it out regularly.
You really only should retain the incrementals until your next snapshot is taken, in other words. Again, this is not a retention guideline recommendation, only a practical note about the redundancy of the backup layers.