architecture – Engineering Blog

In most of distributed datastore systems, there are a lot of techical terms to describe the behavior of the system. While these terms, like, “Leader”, “Follower”, “Replication”, “Consistency”, etc., are widely used and helpful, what I feel missing are the details about internal relationship between these terms.
Analogically, while the map of the field is great, it is also important to understand how the soil, water, and sunlight interact to help the plants grow.

In this blog post, I would like to explore the relationship between “Leader-Follower” and “Primary-Replica” in distributed datastore systems.

Mental model of distributed datastore systems:

Lets first build a mental model of distributed datastore systems. In general, distributed datastore systems are designed to store and manage data across multiple nodes or servers. At a very high level, these systems can be grouped into three models:

Leader-Follower model
Multi-Leader model
Leaderless model

As a quick overview:

In Leader-Follower model, one node is designated as the “Leader” (or “Primary”) and is responsible for handling all write operations. The other nodes, called “Followers” (or “Replicas”), replicate the data from the Leader and handle read operations. Important thing to note here is, all the writes always go to the Leader.
In Multi-Leader model, multiple nodes can act as Leaders and handle write operations. Each Leader replicates its data to other nodes, which can also act as Followers. This model allows for higher availability and fault tolerance, but it can also lead to conflicts if multiple Leaders try to write to the same data simultaneously.
In Leaderless model, there is no designated Leader node. Instead, all nodes are equal and can handle both read and write operations. Data is typically replicated across multiple nodes to ensure availability and fault tolerance. This model can be more complex to manage, as it requires mechanisms to handle conflicts and ensure consistency across all nodes.

There are definitely more nuances to each of these models, but for the purpose of this blog post, we will focus on them conceptually.

Leader-Follower vs Primary-Replica:

You will see above that “leader” and “follower” are often used interchangeably with “primary” and “replica”. However, there are some subtle differences between these terms that are important to understand.
Is it safe to say that “Leader” is always “Primary” and “Follower” is always “Replica”? Not necessarily.

“Leader” refers to the role of a node in the context of write operations. The Leader is responsible for handling all write requests and coordinating the replication of data to Followers.
“Primary” refers to the role of a node in the context of data storage. The Primary is the node that holds the authoritative copy of the data.

When is it safe to say Leader = Primary and Follower = Replica?

To put it more simply, look at it from the lense of “unit of data” handled in the system. In Leader-Follower model (like MongoDB), a single unit of data goes to a single node. If you look at the Leader-Follower cluster like Mongodb, it would look something like below, where there are shards, and each shard has a single Leader (Primary) and multiple Followers (Replicas). The important thing to note here is, “unit of data”, maps to a single node.

So, Leader-Follower systems, generally have a one-to-one relationship between Leader and Primary, and Follower and Replica. The main factor here is that the “unit of data” is mapped to a single node.

When is Leader != Primary and Follower != Replica?

However, in some distributed datastore systems, the relationship between Leader-Follower and Primary-Replica can be more complex. Lets take model of leaderless systems. The idea of leaderless systems is, all the writes don’t have to go to a single node. Now the leaderless systems can still be classified as:

Truly leaderless: where writes can go to any node, as in cassandra.
Semi leaderless: where writes will go to primary ES shard/ Kafka -partition. Those primaries are spread across nodes (not a single host). Unlike Truely leaderless, writes cannot go to any node. They have to go where Primary of “unit-of-data” exist (shard/partition)

In both these cases, the relationship between Leader-Follower and Primary-Replica can be many-to-many. For example, in Cassandra, a single write operation can be handled by multiple nodes, each of which can act as both a Leader and a Primary for different units of data. Similarly, in Kafka, a single partition can have multiple Leaders and Replicas spread across different nodes.
As shown in the image below for Kafka, a single node can be Leader for one partition (Primary for unit-of-data) and Follower for another partition (Replica for unit-of-data). So, in this case, the “unit of data” is not mapped to a single node.

Conclusion:

The way to understand to digest this in by merging two concepts:

Understand the “unit of data” in the system, and how it maps to nodes
Understand the model of the distributed datastore system (Leader-Follower, Multi-Leader, Leaderless)

In Leader-Follower systesms (like Mongodb), the “unit of data” maps to a single node, so Leader = Primary and Follower = Replica.
In Leaderless systems (like Cassandra, Kafka), the “unit of data” can be spread across multiple nodes, so Leader != Primary and Follower != Replica.

By understanding these relationships, you can better design and manage distributed datastore systems to meet your specific needs.

In distributed systems, the principle of message passing between nodes is a core concept. But this leads to an inevitable question: How can we ensure that a message was successfully delivered to its destination?

To address this, there are three types of delivery semantics commonly employed:

• At Most Once

• At Least Once

• Exactly Once

Each of these offers different guarantees and trade-offs when it comes to message delivery. Let’s break down each one:

1. At Most Once

This semantic guarantees that a message will be delivered at most once, without retries in case of failure. The risk? Potential data loss. If the message fails to reach its destination, it’s not retried.

2. At Least Once

Here, the message is guaranteed to be delivered at least once. However, retries are possible in case of failure, which can lead to duplicate messages. The system must be designed to handle such duplicates.

3. Exactly Once

This ideal semantic ensures that the message is delivered exactly once. No duplicates, no data loss. While it’s the most reliable, it’s also the most complex to implement, as the system must track and manage message states carefully.

Achieving the Desired Delivery Semantics

To ensure these semantics are adhered to, we rely on specific approaches. Let’s examine two of the most important ones:

Idempotent Operations Approach

Idempotency ensures that even if a message is delivered multiple times, the result remains unchanged. A simple example is adding a value to a set. Regardless of how many times the message is received, the set will contain the same value.

This approach works well as long as no other operations interfere with the data. If, for example, a value can be removed from the set, idempotency may fail when a retry re-adds the value, altering the result.

Idempotency runs more close to the philosophy of stateless. Each message is treated independently without caring if it is different or same. If the signature of the message is the same, it will generate the same output.

Deduplication Approach

When idempotency isn’t an option, deduplication can help. By assigning a unique identifier to each message, the receiver can track and ignore duplicates. If a message is retried, it will carry the same ID, and the receiver can check whether it has already been processed.

Deduplication generally requires aggressive state tracking, checking on the requestId(from db or cache) before processing every item. The focus at implementation is that, the duplicate messages don’t reach the processing state at all.

However, there are several challenges to consider:

• How and where to store message IDs (often in a database)

• How long to store the IDs to account for retries

• Handling crashes: What happens if the receiver loses track of message IDs during a failure?

My Preference: Idempotent Systems

In my experience, idempotent systems are simpler and less complex than deduplication-based approaches. Idempotency avoids the need to track messages and is easier to scale, making it the preferred choice for most systems, unless the application logic specifically demands something more complex.

Exactly Once Semantics: Delivery vs. Processing

When we talk about “exactly once” semantics, we need to distinguish between delivery and processing:

• Delivery: Ensuring that the message arrives at the destination node at the hardware level.

• Processing: Ensuring the message is processed exactly once at the software level, without reprocessing due to retries.

Understanding this distinction is essential when designing systems, as different types of nodes—compute vs. storage—may require different approaches to achieve “exactly once” semantics.

Delivery Semantics by Node Type

The role of the node often determines which semantics to prioritize:

• Compute Nodes: For these nodes, processing semantics are crucial. We want to ensure that the message is processed only once, even if it arrives multiple times.

• Storage Nodes: For storage systems, delivery semantics are more important. It’s critical that the message is stored once and only once, especially when dealing with large amounts of data.

In distributed system design, the delivery semantics of a message are critical. Deciding between “at most once,” “at least once,” or “exactly once” delivery semantics depends on your application’s needs. Idempotent operations and deduplication offer solutions to the challenges of message retries, each with its own trade-offs.

Ultimately, simplicity should be prioritized where possible. Idempotent systems are generally the easiest to manage and scale, while more complex systems can leverage deduplication or exactly once semantics when necessary.