How does data distribution work in Cassandra?

Cassandra uses data partitions to distribute data to each node. Partitions are based on the partition keys. Partition function(partition key) = token number.

Table of Contents

Apache Cassandra Data Partitions

Apache Cassandra, a NoSQL database, belongs to the big data family of applications and operates as a distributed system, and uses the principle of data partitioning as explained above. Data partitioning is performed using a partitioning algorithm which is configured at the cluster level while the partition key is configured at the table level.

The Cassandra Query Language (CQL) is designed on SQL terminologies of table, rows and columns. A table is configured with the ‘partition key’ as a component of its primary key. Let’s take a deeper look at the usage of Primary key in the context of a Cassandra cluster.

Get Your Free Linux training!

Join our free Linux training and discover the power of open-source technology. Enhance your skills and boost your career! Start Learning Linux today - Free!

Primary Key = Partition Key + [Clustering Columns]

A primary key in Cassandra represents a unique data partition and data arrangement within a partition. The optional clustering columns handle the data arrangement part. A unique partition key represents a set of rows in a table which are managed within a server (including all servers managing its replicas).

A primary key has the following CQL syntax representations:

DEFINITION 1:

CREATE TABLE server_logs(

log_hour timestamp PRIMARYKEY,

log_level text,

message text,

server text

)

partition key: log_hour

clustering columns: none

DEFINITION 2:

CREATE TABLE server_logs(

log_hour timestamp,

log_level text,

message text,

server text,

PRIMARY KEY (log_hour, log_level)

)

partition key: log_hour

clustering columns: log_level

DEFINITION 3:

CREATE TABLE server_logs(

log_hour timestamp,

log_level text,

message text,

server text,

PRIMARY KEY ((log_hour, server))

)

partition key: log_hour, server

clustering columns: none

DEFINITION 4:

CREATE TABLE server_logs(

log_hour timestamp,

log_level text,

message text,

server text,

PRIMARY KEY ((log_hour, server),log_level)

)WITH CLUSTERING ORDER BY (column3 DESC);

partition key: log_hour, server

clustering columns: log_level

This set of rows is generally referred to as a ‘partition’.

Definition1 has all the rows sharing a ‘log_hour’ as a single partition.
Definition2 has the same partition key as Definition1, but all rows in each partition are arranged with the ascending order ‘log_level’.
Definition3 has all the rows sharing a ‘log_hour’ for each distinct ‘server’ as a single partition.
Definition4 has the same partition as Definition3, but it arranges the rows with descending order of ‘log_level’ within the partition.

Cassandra read and write operations are performed using a partition key on a table. Cassandra uses ‘tokens’ (a long value out of range -2^63 to +2^63 -1) for data distribution and indexing. The tokens are mapped to the partition keys using a ‘partitioner’. The partitioner applies a partitioning function to convert any given partition key to a token. Each node in a Cassandra cluster owns a set of data partitions using this token mechanism. The data is then indexed on each node with the help of the partition key. The takeaway here is, Cassandra uses partition key to determine which node store data on and where to find data when it’s needed. Partition function(partition key) = token number.

*This is a simple representation of tokens, the actual implementation uses Vnodes.

Impacts of data partition on Cassandra clusters

Controlling the size of the data stored in each partition is essential to ensure even distribution of data across the cluster and to get good I/O performance. Below are the impacts Partitioning has on some of the different aspects of a Cassandra cluster:

Read Performance: Cassandra maintains caches, indexes and index summaries to locate partitions within SSTables files on disk. Large partitions cause inefficiency in maintaining these data structures and result in performance degradation. The Cassandra project has made several improvements in this area, especially in version 3.6 where the engine was restructured to be more performant for large partitions and more resilient against memory issues and crashing.
Memory Usage: The partition size directly impacts on the JVM heap size and garbage collection mechanism. Large partitions increase pressure on the JVM heap and make garbage collection inefficient.
Cassandra Repairs: Repair is a maintenance operation to make data consistent. It involves scanning data and comparing with other data replicas followed by data streaming if required. Large partition sizes make it hard to repair data.
Tombstones Eviction: Cassandra uses unique markers called ‘tombstones’ to mark data deletion. Large partitions can contribute to difficulties in tombstone eviction if data deletion pattern and compaction strategy are not appropriately implemented.

Being aware of these impacts helps in an optimal partition key design while deploying Cassandra. It might be tempting to design the partition key to having only one row or a few rows per partition. However, a few other factors might influence the design decision, primarily, the data access pattern and an ideal partition size.

Partitioning Key design

Now let’s look into designing the partitioning key that leads to an ‘ideal partition size’. The practical limit on the size of a partition is two billion cells, but it is not ideal to have such large partitions. The maximum partition size in Cassandra should be under 100MB and ideally less than 10MB. Application workload and its schema design haves an effect on the optimal partition value. However, a maximum of 100MB is a rule of thumb. A ‘large/wide partition’ is hence defined in the context of the standard mean and maximum values.

In the versions after 3.6, it may be possible to operate with larger partition sizes. However, thorough testing and benchmarking for each specific workload is required to ensure there is no impact of your partition key design on the cluster performance.

Below are some best practices to consider when designing an optimal partition key:

A partition key for a table should be designed to satisfy its access pattern and with the ideal amount of data to fit into partitions.
A partition key should not allow ‘unbounded partitions’. An unbounded partition grows indefinitely in size as time passes.

In the server_logs table example, if the server column is used as a partition key it will create unbounded partitions as logs for a server will increase with time. The time attribute of log_hour, in this case, puts a bound on each partition to accommodate an hour worth of data.

A partition key should not create partition skew, in order to avoid uneven partitions and hotspots. A partition skew is a condition in which there is more data assigned to a partition as compared to other partitions and the partition grows indefinitely over time.

In the server_logs table example, suppose the partition key is server and if one server generates way more logs than other servers, it will create a skew.

Partition skew can be avoid
ed by introducing some other attribute from the table in the partition key so that all partitions get even data. If it is not feasible to use a real attribute to remove skew, a dummy column can be created and introduced to the partition key. The dummy column then distinguishes partitions and it can be controlled from an application without disturbing the data semantics.

In the skew example above, consider a dummy column partition smallint is introduced and the partition key is altered to server, partition. Now the application logic sets the partition attribute to 1 until there are enough rows in a partition and then it sets partition to 2 for the same server.

Time Series data can be partitioned using a time element in the partition key along with other attributes. This helps in multiple ways –
it works as a safeguard against unbounded partitions
access patterns can use the time attribute to query specific data
data deletion can be performed for a time-bound etc.

In the server_logs table, all four definitions use the time attribute log_hour. All four definitions are good examples of bounded partitions by the hour value.

Summary

The important elements of the Cassandra partition key discussion are summarized below:

Each Cassandra table has a partition key which can be standalone or composite.
The partition key determines data locality through indexing in Cassandra.
The partition size is a crucial attribute for Cassandra performance and maintenance.
The ideal size of a Cassandra partition is equal to or lower than 10MB with a maximum of 100MB.
The partition key should be designed carefully to create bounded partitions with size in the ideal range.
It is essential to understand your data demographics and consider partition size and data distribution when designing your schema. There are several tools to test, analyse and monitor Cassandra partitions.
The Cassandra version 3.6 and above incorporates significant improvements in the storage engine which provides much better partition handling.

Reference:

http://instaclustr.com/cassandra-data-partitioning/