Message delivery in Distributed Systems

In distributed systems, the principle of message passing between nodes is a core concept. But this leads to an inevitable question: How can we ensure that a message was successfully delivered to its destination?

To address this, there are three types of delivery semantics commonly employed:

• At Most Once

• At Least Once

• Exactly Once

Each of these offers different guarantees and trade-offs when it comes to message delivery. Let’s break down each one:

1. At Most Once

This semantic guarantees that a message will be delivered at most once, without retries in case of failure. The risk? Potential data loss. If the message fails to reach its destination, it’s not retried.

2. At Least Once

Here, the message is guaranteed to be delivered at least once. However, retries are possible in case of failure, which can lead to duplicate messages. The system must be designed to handle such duplicates.

3. Exactly Once

This ideal semantic ensures that the message is delivered exactly once. No duplicates, no data loss. While it’s the most reliable, it’s also the most complex to implement, as the system must track and manage message states carefully.

Achieving the Desired Delivery Semantics

To ensure these semantics are adhered to, we rely on specific approaches. Let’s examine two of the most important ones:

Idempotent Operations Approach

Idempotency ensures that even if a message is delivered multiple times, the result remains unchanged. A simple example is adding a value to a set. Regardless of how many times the message is received, the set will contain the same value.

This approach works well as long as no other operations interfere with the data. If, for example, a value can be removed from the set, idempotency may fail when a retry re-adds the value, altering the result.

Idempotency runs more close to the philosophy of stateless. Each message is treated independently without caring if it is different or same. If the signature of the message is the same, it will generate the same output.

Deduplication Approach

When idempotency isn’t an option, deduplication can help. By assigning a unique identifier to each message, the receiver can track and ignore duplicates. If a message is retried, it will carry the same ID, and the receiver can check whether it has already been processed.

Deduplication generally requires aggressive state tracking, checking on the requestId(from db or cache) before processing every item. The focus at implementation is that, the duplicate messages don’t reach the processing state at all.

However, there are several challenges to consider:

• How and where to store message IDs (often in a database)

• How long to store the IDs to account for retries

• Handling crashes: What happens if the receiver loses track of message IDs during a failure?

My Preference: Idempotent Systems

In my experience, idempotent systems are simpler and less complex than deduplication-based approaches. Idempotency avoids the need to track messages and is easier to scale, making it the preferred choice for most systems, unless the application logic specifically demands something more complex.

Exactly Once Semantics: Delivery vs. Processing

When we talk about “exactly once” semantics, we need to distinguish between delivery and processing:

• Delivery: Ensuring that the message arrives at the destination node at the hardware level.

• Processing: Ensuring the message is processed exactly once at the software level, without reprocessing due to retries.

Understanding this distinction is essential when designing systems, as different types of nodes—compute vs. storage—may require different approaches to achieve “exactly once” semantics.

Delivery Semantics by Node Type

The role of the node often determines which semantics to prioritize:

• Compute Nodes: For these nodes, processing semantics are crucial. We want to ensure that the message is processed only once, even if it arrives multiple times.

• Storage Nodes: For storage systems, delivery semantics are more important. It’s critical that the message is stored once and only once, especially when dealing with large amounts of data.

In distributed system design, the delivery semantics of a message are critical. Deciding between “at most once,” “at least once,” or “exactly once” delivery semantics depends on your application’s needs. Idempotent operations and deduplication offer solutions to the challenges of message retries, each with its own trade-offs.

Ultimately, simplicity should be prioritized where possible. Idempotent systems are generally the easiest to manage and scale, while more complex systems can leverage deduplication or exactly once semantics when necessary.

Personal “FinOps” with Ledger cli

This post is a geek-out journey this festive season on finding the right tool for my personal finance management.

Where it all started:

Recently, I spoke at the Smarsh Tech Summit on “Cost as an Architectural Pillar,” where I emphasized the importance of considering cost as a first-class citizen in the software development cycle.

However, on a similar note, when I was looking through my personal finances later, I wasn’t very happy when I had to apply the same principle.
I was using one of the apps for finance management, and it was all over the place. Since it is festive time off at work, I started looking around for the best way to fix this and track my personal finance the right way.

When I started looking out for tools – I had a set of criteria:

Can the tool follow “local first” approach? I don’t want to share all my financial data to a third party tool.
Can it work well on terminal? (I spend too my time at my laptop – not phone)
Can I query it via cli and get only what I need?

Ledger cli:

While this lead me to a few options, nothing came close to what ledger-cli can do.
https://ledger-cli.org/doc/ledger3.html

Managing a full fledge ledger book for personal finance looked daunting at the first site, but the CLI compatibilities kept me hooked, and it has been really worth the time and effort.
Note : If you decide to take this route, I would highly recommend reading through Basics of accounting with ledger
https://ledger-cli.org/doc/ledger3.html#Principles-of-Accounting-with-Ledger

Lets start with a few examples first on how I use it:

Install the ledger cli. For mac from here
At the end of this post, you will find a example ledger file. Save it as “transaction.ledger” file. It has all dummy values. Lets see a few queries on it first.
What is actual net-worth right now?
- ledger -f transactions.ledger bal assets liabilities

How do my expenses look like and how to they tally against source accounts?
- ledger -f transactions.ledger bal

What are my top expenses, sorted based on amount spent?
- leger -f transaction reg Expenses -S amount

A few other interesting queries :
- How much did I spend and earn this month? – “ledger bal ^Expenses ^Income --invert“
- How much did I spend over the course of three days? – “ledger reg -b 01/25 -e 01/27 --subtotal“
You can even create a monthly budget and stick within that.

Now that we know what a ledger can do, here are a few features of it:

Simple but Powerful Double-Entry Accounting: Ledger CLI follows double-entry bookkeeping, which helps track every asset and liability in a systematic way. It’s not just a checkbook register; it’s a full-fledged accounting system that works in plain text. I write down my expenses, income, and transfers, and it keeps everything balanced.
Assets and Liabilities Management: Managing assets like savings accounts or liabilities like credit cards is straightforward. You simply create accounts and keep track of every inflow and outflow. For me, categorizing my finances into different buckets like “Bank”, “Credit Card”, and “Investments” helps give me a full picture.
Automation and CLI Integration: One of the best parts of using Ledger CLI is the ease of automation. With simple bash scripts, I’ve automated some repetitive tasks—like importing my bank statements or tallying up expenses at the end of the week. Using cron jobs, I’ve even set up scheduled jobs to summarize my financial status, directly in my terminal, every Sunday.
Customization with Neovim: Since Ledger CLI is just text, it means I can edit everything directly in Neovim. With some custom syntax highlighting and autocompletion settings, it’s easy to track and categorize transactions quickly. The whole experience is tailored exactly to my taste—simple, keyboard-driven, and powerful.
Obsidian plugin for Ledger cli: I use obsidian for all my note taking. Having the cli plugin from within obsidian is very convenient if I want to plot expense graphs.
https://github.com/tgrosinger/ledger-obsidian

The fact that I can manage all my finances from the terminal, have only a local/git copy of it, and have native obisidian integrations is working well for ledger cli and me.

And btw, you still have to enter the expense entries on your own. There is a potential for automating it via inputing a statement file, but I am happy maintaining it manually for now.

PS:

Below is how ledger file looks like {all dummy values} – if you want to play around or use for reference template:

The first part of the file manage Assets, Liabilities, Expense and Income aliases.
Starting Balances : This section records the assets and liabilities on Day0
The third part of the file shows the expense entries. It has two part – form of expense and the account the expense came from.

alias a=Assets
alias b=Assets:Banking
alias br=Assets:Banking:RD
alias bfd=Assets:Banking:FD
alias c=Liabilities:Credit
alias l=Liabilities
alias e=Expenses
alias i=Income

; Lines starting with a semicolon are comments and will not be parsed.

; This is an example of what a transaction looks like.
; Every transaction must balance to 0 if you add up all the lines.
; If the last line is left empty, it will automatically balance the transaction.

; Starting Balances
; Add a line for each bank account or investment account
b:HDFC                 ₹150000.00
b:SBI                  ₹40000.00
bfd:AxisBank           ₹80000.00
a:Investments:MutualFunds ₹30000.00
StartingBalance        ; Leave this line alone

2024-10-01 Gym Membership Payment
  e:Fitness               ₹600.00 ; To this account
  c:HDFCCredit                 ; From this account

2024-10-03 Grocery Shopping at BigBazaar
  e:Groceries            ₹1200.00
  b:SBI

2024-10-04 Netflix Subscription
  e:Entertainment        ₹450.00
  c:AxisCredit

2024-10-06 Restaurant - Dinner with Friends
  e:Dining               ₹1800.00
  b:HDFC

2024-10-08 Salary for October
  i:JobIncome           ₹65000.00
  b:HDFC

2024-10-10 Rent Payment
  e:Rent                 ₹14000.00
  b:HDFC

2024-10-12 Medical Bills
  e:Medical              ₹2000.00
  c:HDFCCredit

2024-10-14 Bike Maintenance
  e:Transport            ₹700.00
  b:SBI

2024-10-15 Investing in Fixed Deposit
  bfd:AxisBank          ₹8000.00
  b:HDFC

2024-10-16 Online Shopping (Amazon)
  e:Shopping             ₹3500.00
  c:AxisCredit

2024-10-18 Electricity Bill Payment
  e:Utilities            ₹1200.00
  b:HDFC

2024-10-20 Travel - Weekend Getaway
  e:Travel               ₹2500.00
  b:HDFC

2024-10-22 Dining - Coffee with Colleagues
  e:Dining               ₹250.00
  b:SBI

2024-10-24 Monthly SIP Investment
  a:Investments:MutualFunds  ₹800.00
  b:HDFC

2024-10-26 Mobile Bill
  e:Communication        ₹350.00
  c:AxisCredit

2024-10-28 Gift for Friend's Birthday
  e:Gift                 ₹1300.00
  b:HDFC

2024-10-30 Savings Transfer to Recurring Deposit
  br:HDFC                ₹4000.00
  b:HDFC

Knowledge management with Obsidian

This is a brain dump on how taking notes and Obsidian as a tool has helped me.

Knowledge management?

As one progresses further into career, Knowledge management becomes as equally important as Finance management. Knowledge accumulation is a non-linear trajectory. Majority of the times it is compounding in nature. If one doesn’t organise it, you are always at the mercy of “I had solved this once before but don’t remember how“.

This writeup is a walk through of Obsidian which I use for all my notes (work and life).

First things first. Obsidian is just a tool. It only shines when one is in the habit of regularly taking notes. I obsessively write everything down because I don’t trust my mind. I can forget anything and everything. Moreover, it is much peaceful when I am not compelled to remember everything. I can rely on my notes which I can always get back to.

I am a fan of Tiago Forte. While he has a book on building second brain, here is a quick overview where he explains the importance of taking notes. YouTube – 6mins.

While I do have a case for why obsidian is better than other tools – it is true that it has a larger initial learning curve. I have tried them all. Evernote, OneNote, Notion. They all have their right place and I love notion in particular. But as the number of notes grow in particular, that is when Obsidian shines. It connects them all and shows the interlinking between them. I had like 1200+ notes when I migrated from Notion into Obsidian, and below is how my notes graph looks like.

A zoomed in section of it:

But again, why Obsidian ?

Markdown and simple notes first approach: Obsidian shines at doing one thing and that one thing well – taking notes. It doesn’t focus on Databases and fancy visuals.
Connecting the dots: As you collect and write more of your notes, Obsidian ties it all together. You might have written a self-note about ElasticSearch and completely forgotten about it. But Obsidian shows it in your knowledge graph with all potential matches. Tags, auto-linking, Graph view – all work like a charm.
Git backup: I was a Notion user before I moved to Obsidian. I used to be paranoid if Notion blocks my account for whatever reason. I had to take a weekly backup – just in case. With Obsidian there are more than ways of backup. I use github. More on Obsidian plugins here.
Ease with terminal: Since Obsidian deals with markdown files(.md), you don’t have to leave the terminal, if you are a terminal person. Nvim has loads of plugins which will make you fall in love with Obsidian, like – telescope , vim-markdown , treesitter etc.
Community driven: r/ObsidianMd subreddit has some of the kindest folks. They always help and there are new plugins available almost every week for any fancy stuff.
Sync between devices: While other tools like Notion, OneNote shine at this(out of the box) – for obsidian I use gitbackup for sync. For my phone I use Syncthing, a filesync setup for low latency syncing of all my notes. It works like a charm.

Full ownership of your data: One of my favorite features of Obsidian is its use of plain text files by default, which offers several advantages: notes can be accessed offline, edited with any text editor, viewed with various readers, easily synced through services like iCloud, Dropbox, or git, and remain yours forever. Obsidian’s CEO, Steph Ango, elaborated on this philosophy in a blog post here.

Use-case where Obsidian helped:

I was recently invited to a weekend geekout session. Only criteria was to speak about a “not-so-technical” topic.
I picked the topic of “Thinking well” and dived into my Obsidian vault. – YouTube link – 20mins.
As I geeked out, to my surprise, 3 non-related books connected with each other for Thinking.
A fiction, A philosophical, A non-fiction. Thanks to my obsessive amount of notes, I was able to link them all and tie them up – out of the box on Obsidian. More importantly, I could see the parallels between the books myself without intentionally thinking about them.

Obsidian notes helping link Three differently streamed books.

Some resources on Obsidian that I have found useful:

While this writeup just a brain dump, I don’t intend to say Obsidian is the only way. There are better ways – just that obsidian is working well for me right now.

Below are some resources to dive deep into obsidian:

Detailed notes on the tool – https://www.ssp.sh/blog/obsidian-note-taking-workflow/
Note taking system – https://youtu.be/Xw3SkhB4dMk?si=CUu1W7sGhS0t5VAz
Physical to Digital notes – https://youtu.be/eLqQo38wC2Q?si=NmWwN4vHKVoPIb0i
Obsidian docs – https://help.obsidian.md/Home

Let me know what you folks use for notes! Cheers.

Bloom Filter and Search Optimisation

This writeup is an outcome of a side quest while geeking out on System Design.
In the book “Designing Data-Intensive Applications,” Bloom Filters are briefly mentioned in the context of datastores, highlighting their significance in preventing database slowness caused by lookup for nonexistent items.

Below are a curious set of questions on the topic on Bloom Filters and how it works.

What is the use case for a Bloom filter?

Imagine you are maintaining a Datastore which has millions of records. You want to search for an item form the Datastore, while you are not sure that the item exists in the first place.

Below is the path for data retrieval, at a very high level (without a bloom filter) on a datastore:

A few points to note here:

The items is first looked up for in the cache. If the row cache contains it from recent access, it is returned.
If row cache doesn’t have the item, key cache is checked. It contains positions of row keys within SSTables, for recently accessed item. If item key found, can be directly retrieved. Cassandra uses this. Reference link
If above cache layer don’t have item, it is looked up in the index for the datastore table.
- while indexes are meant to be fast, the “primary key” we are searching with, should have a index in the first place.
- even if the index is present, it can have million of entries. I have seen indexes which are 100+ GBs in size.
If the item local from index is found, do an SSTable lookup to retrieve the item with disk seek.(if SSTable is not in memory)

All the above points are for the path flow where the item exists in the Datastore. The worst case is, all the above paths are traversed and to eventually find that item doesn’t exist.
Is there a way to NOT do O(n) on the items stored, to know for sure if the item doesn’t exist in the datastore ?

That is where Bloom Filter comes in handy.
The primary case for Bloom filter is to make sure that most lookups for non-existent rows or columns do not need to touch disk.

How does a Bloom Filter work?

A Bloom Filter is a data-structure that helps answer whether an element is a member of a set. Meaning, if you want to know if an item exists in a datastore, Bloom Filter should help answer it – without scanning the whole datastore.

Bloom filter does allow for false positives. However, it never produces false negatives. So it may tell you have an item in store which doesn’t exist, but if it says an item is definitely not in the set, you can trust it.

At the core of a Bloom filter implementation:

It contains a Long Bit Array of bits (0s and 1s), initialised to 0. This array is our Bloom Filter.
A bloom filter makes use of multiple hash functions(k). These hash functions will take an input item and map it to a position on the the bit array.
When we want to add an element to Bloom filter
- pass the element through k hashfunctions (k=3 in below case)
- each hash function maps the element to a position on the bit array
- set the bit for the positions mapped to 1.

To check if an element is present in bloom filter
- pass the element to k hashfunction
- each hashfunction maps the element to a position on the bit array
- if all the bits are 1, the element is probably in the datastore (there might be false positives)
- if any bit is 0, the element is definitely not in the datastore.

With Bloom filter being added, if we had to recreate the data retrieval path for element in datastore, it would look like below:

A few other notes:

Not all databases/datastores have Bloom Filter built-in. Traditional Relational databases don’t have it built-in. However, it can be implemented at the application layer.
NoSQL databases like Cassandra, HBase have Bloom filter built in.
Some datastores like Dynamodb use other technique like Secondary indexes and partitioning to solve the same usecase.

Resource Usages due to Bloom Filter:

Bloom filters are in memory to meet the fast response to check if an item NOT present.
The size and memory usage of Bloom Filter is dependent on factors like:
- Number of Items (n): The number of elements you expect to store.
- False Positive Rate (p): The acceptable probability of false positives.
- Number of Hash Functions (k): Typically derived from the desired false positive rate and the size of the bit array.
- Size of the Bit Array (m): The total number of bits in the Bloom filter.
Mathematically finding the size of bloom filter is beyond the scope here, but lets say if we want to store 1 million ID with a false positive rate of 1% – it would take less than 2MB of memory for bloom filter.

In summary, Bloom filters prevent unnecessary item lookups and disk seeks for elements that do not exist in datastores. Without the use of Bloom filters (or similar implementations), it is easy for performance to degrade in any datastore due to frequent searches for nonexistent items.

References:

The above explanation is only at the conceptual level. Deep dive on the math being prediction and implementation on different databases are available in references below:

Paper: Bloom Filter Original Paper – “Space/Time Trade-offs in Hash Coding with Allowable Errors” – link
Paper: “Scalable Bloom filters” paper – link
Paper: “Bigtable: A Distributed Storage System for Structured Data” – link
- Side Track(non-technical) : You will definitely enjoy the story on how Sanjay and Jeff solved early day issues at Google – “The Friendship That Made Google Huge” – link
Casandra documentation on Bloom Filter usage – link
HBase documentation on Bloom Filter usage – link

Weekly Bullet #42 – Summary for the week

Here are a bunch of Technical / Non-Technical topics that I came across recently and found them very resourceful.

Technical :

The cost crisis in the observability space is a real problem. Here is an article that describes the issue: – link here
How many conferences are too many? Here is an exhaustive list of all the popular talks on Kubernetes from 2023: – link here
OpenTelemetry is the industry standard in observability. Here is a list of anti-patterns with observability to avoid:- link here
Here is a write-up on a framework for data sharing between microservices – link here
GoDaddy Data Platform optimizations: 60% cost reduction and 50% performance improvement – link here
I have been deeply exploring Kafka lately, and below are some of the great resources I have found:
- Effectively Once Delivery and Exactly Once Processing in Kafka – link here
- Segments, Rolling and Retention in kafka – link here
- Kafka internals – link here

Source – https://twitter.com/BdKozlovski/status/1771556014233649437/photo/1

[Most recommended] “Learn Programming In 10 Years” – A video by ThePrimeTime on Youtube link here (23mins).

Non-Technical :

[YouTube] “How Future Billionaires Get Sh*t Done” (20mins): Discusses the importance of having a larger chunk of non-disruptive time to accomplish tasks: – link here
Currently, 94% of the universe’s volume is beyond our reach. – How quickly is the Universe disappearing from our reach?
You are only as good as your worst days – FS blog explains it the best – link here
An extract from a book:

“Wisdom cannot be imparted. Wisdom that a wise man attempts to impart always sounds like foolishness to someone else… Knowledge can be communicated, but not wisdom. One can find it, live it, do wonders through it, but one cannot communicate and teach it.”
“Siddhartha” by Hermann Hesse

Cheers until next time !

Kafka – an efficient transient messaging system

Over the past few years, I have worked on different multi-cluster distributed datastores and messaging systems like – ElasticSearch, MongoDB, Kafka etc.

From the Platform Engineering/SRE perspective, I have seen multiple incidents with different distributed datastores/messaging systems. Typical ones being :

uneven node densities (ElasticSearch – how are you creating shards?)
client node issues (client/router saturation is a real thing. And they need to be HA)
replicas falling being masters, (Mongo – I see you)
scaling patterns with cost in mind (higher per node density without affecting SLOs)
Consistency vs Availability – CAP (for delete decisions in application- you better rely on consistency)
Network saturation – (they happen)
and more

But Kafka, in my experience, has stood the test of time. It has troubled me the least. And mind you, we didn’t treat it with any kindness.

This lead me to trying to understand Kafka a little better a few years ago. This write up is just a dump of all the information I have collected.

The white paper – Kafka: a Distributed Messaging System for Log Processing. – link
- This is one of the initial papers on Kafka from 2011.
- Kafka has changed/expanded quite a bit since, but this gives a good ground on design philosophy of Kafka
What does kafka solve? Design philosophy.
- along with being a distributed messaging queue – the design philosophy is around being fast (achieving high throughput) and efficient
- Pull based model for message consumption. The applications can read and consume at the rate they can sustain. Built in rate-limiting for the application without a gateway.
- No explicit caching – rather rely on system page cache. I seen a node with a 4TB data work just fine with 8GB memory needs.
- Publisher/Consumer model for topics. The smallest unit of scale for consumer is Partitions on the topics.
- Stateless broker : unlike other messaging services, the information about how much each consumer has consumed is not maintained by broker – but by consumer itself.
- A consumer can rewind back to old offset and re-consume data
- The above mentioned white paper has great insights on design philosophy.
Kafka doesn’t have client/router nodes. How does it load balance?
- Kakfa doesn’t need a load balancer. A cluster of kafka just has the broker nodes.
- A stream of messages of particular type are configured to go to a topic in kafka.
- A topic can have multiple partitions and each partition has a leader and followers. (number of followers is based on replication set – for HA)
- Leaders of a partition(for a topic) will be evenly distributed across brokers. And writes to a topic from Producer – always go to the leader partitions.
- so, the load will be evenly distributed across kafka brokers – as long as – when new message that are written to a topic – are spread evenly across the partitions of a topic.
- That spreading of messages evenly between partitions is a function of shardkey configured – like any other distributed system. If sharding is not done – round robin is used.
- Below is the visualization of 1 topic – with 3 partitions and replica set to 3. Also has 3 brokers in the kafka node.

source – https://sookocheff.com/post/kafka/kafka-in-a-nutshell/

What makes Kafka efficient?
- Consumption of messages(events) from kafka partitions is sequentially. Meaning, a consumer always consumes messages in the same order it was written to the partition. There are no random-seeks for data, like you might see in a database/other datastores.
- This data access pattern of going sequentially rather than random – make it fast by several order of magnitude.
- The idea of using the system page cache – rather than building its own cache. This avoid double buffering. Additionally, warm cache is retained even when the broker restarts. Since the data is always read sequentially, the need of an active process cache is limited.
- Very little overhead on garbage collection – since kafka doesn’t cache message in process.
- Optimized network access for consumers.
  - A typical approach to send any file from local to remote socket involves 4 steps:
    (1) read data from storage in to OS page cache
    (2) copy data from page cache into application buffer
    (3) make another copy to kernel buffer
    (4) send kernel buffer via socket
  - Kafka uses the Unix SendFile api – and sends the stored data directly from OS page cache to the socket. That is avoiding 2 copies of data and one system call. (2) and (3) avoided.
  - Since kafka doesn’t maintain any cache/index on its own(as discussed above), it skips two of those copies.
Replication and Leader/Follower in Kafka:
- replication is configured at the topic level and the unit of replication is partition (as seen in image attached above)
- Replication Factor 1 in kafka means – there is no replication set, but just the source copy. Replication Factor 2 mean, there is one source and one replica.
- read/write all go to leader – in latest version you can set the read to go to secondary. But since the replication is at partition level – all reads from a consumer to a topic are already spread to more than one node
- also note that partition count can be more than number of brokers. A topic can have 16 (or more – I have seen till 256) partitions and can still have 3 brokers
- The Kafka brokers which have the replicas for a partition of a topic and are insync with the leader are called – In sync replicas
- A broker can be thrown out of the cluster based on two condition
  - if the leader doesn’t receive the heartbeat
  - if the follower falls too behind the master – The replica.lag.time.max.ms configuration specifies what replicas are considered stuck or lagging.
How is Fault tolerance handled in Kafka ?
- Fault tolerance is directly dependent on how we maintain and handle the Leader/Replicas. Two ways of doing this in data store:
  - Primary Backup Replication – {Kafka uses this}
    - Just the plain secondary insync with primary based on data.
    - so for a cluster to be up – If we have F replicas, F-1 can fail
    - if ack=all is set, it will still pass with F-1 when producer makes a write – because the leader will remove the unhealthy replicas. Even if 1 is present – producer moves ahead. — more here
  - Quorum based replication
    - this is based on Majority wait algorithm.
    - for F number of nodes to fail, we need to have 2F+1 broker in the cluster.
    - A majority vote quorum is required when selecting a new leader when the cluster goes through rebalancing and partitioning.
    - For a cluster to pick a new leader under quorum based setting – for if 2 nodes to down in a cluster – the cluster should have atleast 5nodes. (2F+1)
    - More on this – here
- For the over all cluster state and for a message is considered as acknowledged by the Kafka, the ack settings for the Producer are important. Details on that here
CAP theorem on Kafka – what to do when replicas go down?
- Partitions are bound to fail. So it is just a question of Consistency vs Availability
- Consistency : if we want highly consistent data, the cluster can wait on reads/writes until all ISR come back in sync. This adversly affects availability
- Availability : if we allow a replica which is not currently in sync with latest when a leader went down, the transaction will still proceed.
- By default kafka from v0.11 favours Consistency – but this can be overridden with unclean.leader.election.enable flag.

Some link and references:

The white paper – Kafka: a Distributed Messaging System for Log Processing. – link
Kafka basic concepts – link
Consumer group and why is it required in kafka – link
How does kafka store data in memory link
Different message delivery guarantees and semantics – link
Kafka ADR – Insync replicas – link

Weekly Bullet #41 – Summary for the week

Here are a bunch of Technical / Non-Technical topics that I came across recently and found them very resourceful.

Technical :

System Design – Designing a Ticket Booking Site Like Ticketmaster is the most common system design question – link
UberEngineering blog on Anomaly detection and alerting system – link
P99 CONF 2023 | Always-on Profiling of All Linux Threads by Tanel Poder – YouTube link
On choosing Golang as a programming language at American Express- link
[Gold] : Best Papers awarded in Computer Science – sorted Yearly and Topic wise – link
Writing an Engineering Strategy – link
- “If there’s no written decision, the decision is risky or a trap-door decision, and it’s unclear who the owner is, then you should escalate!”
ThePrimeTime is a popular youtuber who is a SWE at Netflix. Here is his take on Leetcode – YouTube Link
LWN is one of the few sane NL left out there. Here is their take on 2023 – link
The Hacker News Top 40 books of 2023 – link (60+ books)

Non-Technical :

HBR – 5 Generic reasons why people get laid off – link
Paul Graham is one of the most clear thinkers. Here are his most recommended books – link
Geeks being geeks – Why does a remote car key work when held to your head/body? – Detailed analysis link
Book extract

“The more we want it to be true, the more careful we have to be. No witness’s say-so is good enough. People make mistakes. People play practical jokes. People stretch the truth for money or attention or fame. People occasionally misunderstand what they’re seeing. People sometimes even see things that aren’t there.”
The Demon-Haunted World: Science as a Candle in the Dark – Carl Sagan

See you next time!

[Kubernetes]: CPU and Memory Request/Limits for Pods

In this write up, we will try and explore how to make the most out of the resources in K8s cluster for the Pods on them.

Resource Types:

When it comes to resources on Kubernetes cluster, they can be fairly divided in to two categories:

compressible:
- If the usage of this resource for an application goes beyond the max, it can be throttled without directly killing the application/process.
- example : cpu – if a container consumes too much of compressible resource, they are throttled
non-compressible:
- If the usage of this resource goes beyond max, it cannot be directly throttled. Might lead to killing of process.
- example : memory – if a container consumes too much of non-compressible resource, they are killed.

For each pod on a k8s, there are mainly 4 types of resources which need tuned and management based on the application running:
CPU, Memory, Ephermal-storage, Hugepage-<size>

Each of the above mentioned resource can be managed at Provisioning level and Cap usage level on K8s. That is where requests/limits in K8s come in handy.

Request/Limits:

Requests and Limits are the important part of Resource management for Pods and containers.

Requests: is where you define how much of resource your pod needs, when it is getting scheduled on worker node.
Limits: is where you define what is the max value that the resource can stretch to, when consuming the resource on worker node.

Lets consider the deployment yaml file for a application which has request/limit defined on cpu and memory.
It is important to note that when a pod is provisioned on a worker node by kubernetes scheduler, the value mentioned in requests is taken into consideration. The worker node needs to have the amount resource described in requests field for the pod to be scheduled successfully

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
        - name: app1
          resources:
            requests:
              memory: "64Mi"
              cpu: "250m"
            limits:
              memory: "500Mi"
              cpu: "800m" 
    metadata:
      annotations:
        link.argocd.argoproj.io/external-link: app.argo.com/main/

At a high level, the concept of requests/limits is similar to soft/hard limits for resource consumption. (more like xms/xmx in Java). These values are generally defined in the deployment file for the pod.
It is an option to either set Request/Limits individually or skip them altogether(based on the kind of resource). If the requests and limits are set incorrectly, this could lead various issues like:

pod instability
workers being underused
incorrect configuration for compressible and non-compressible resources.
worker nodes being over-committed.
affecting directly the quality of service for a pod (Burstable, Best-Effort, Guaranted)

Now lets try and fit in different Requests/Limit metrics for CPU and Memory resources for an application deployed on K8s cluster

CPU :

CPU is a compressible resource – can be throttled.
It is an option to NOT set Limit on CPU. In that case, if there is more CPU available on the worker note, unused, the pods without limits can over-commmit and use the available CPU.
It is an option to not set Limit for resources which are compressible, because they can be throttled when there is worker needs the memory back.
If your application needs guaranteed Quality of Service, then set the Request==Limit
Below is general plot of Request/Limits for CPU

Memory :

Memory is a non-compressible resource – cannot be throttled. If a container uses more memory, it will be killed by the kubelet.
You cannot ignore Limits like in CPU resource because when the memory need of the app increases, it will over-commit and affect the worker node.
Values for limits and requests based on the application needs and tuned based on production feedback of the container.
If your application needs guaranteed Quality of Service, then set the Request==Limit
Below is general plot of Request/Limits for CPU

Resources for further reading:

Request/Limits – Kubernetes docs : here
Quality of Services classes for pods in Kubernetes – docs : here
Resource types in Kubernetes docs : here

[DDIA Book] : Data Models and Query Languages

[Self Notes and Review]:

This is a second writeup in the series of reading DDIA book and publishing my notes from the book.
The first one can be found here

This particular article is from the second chapter of the book. Again, these are just my self notes/extracts and treat this more like an overview/summary. Best way is to read the book in itself.

This chapter dwells in to the details of: the format in which we write data to databases and mechanism by which we read it back.

First – a few terminologies:

Relational Database – which has rows and columns and a schema for all the data
- Eg: SQL
Non-Relational Database – also known as Document model , Nosql etc , targeting the use case of data comes in self-contained documents and relation between one document to other are rare.
- Eg: Mongo, where the data is stored as a single entity like json object in Mongo
Graph Database – where all the data is stored as a vertex and edge, targeting the use case of anything is potentially related to everything.
- Eg: Neo4j, Titan, and InfiniteGraph
Imperative language – in an Imperative language like a programming language, you tell the compute what to do and how to do. Like, get the data and go over the loop twice in a particular order etc
Declarative language – in Declarative query language like SQL used for retrieving data from database, you know tell it what to do – and how to do is decided by the query optimizer.

Relational Model(RDBMS) vs Document Model

SQL is the best know RDBMS which has lasted for over 30years.
NOSQL – is more of an opposite of RDBMS. And sadly the name “nosql” doesn’t actually refer to any particular technology. It is more of a blanket terms for all non-relational databases.
Advantages of Document (Nosql) over Relational (RDBMS) type databases:
- Ease to scale out in no sql like mongo, where you can add more shards – but in sql type database (relational rdbms type) – they are designed more to scale vertically.
- ability to store unstructured, semi structured or structured data in nosql – while in rdbms you can store only structured data
- Ease of updating schema in no sql – like in mongo you can insert docs with new field and it will work just fine
- you can do blue green deployment in nosql by updating one cluster at a time, but in nosql – you have to take down the system
- https://www.mongodb.com/nosql-explained/advantages
Disadvantages of Document (Nosql) over Relational (RDBMS) type databases
- you cannot directly pick a value from a nested json in document db(you need nested references). In Relation db, you can pick a specific value from its column and row criteria.
- The poor support for joins in document databases may or may not be a problem, depending on the application.

[Use-case]: Relational vs Document models for implementing linkedin design:

Relational Model
- in a relational model like SQL – many user_id can be used as an unique identifier across multiple tables
- region and industries are common tables which can be used for different users
- IMPORTANT : in the above example – users table has region_id and industry_id – i.e, it has an id and not the common free text.
  - This helps maintain consistency and avoid ambiguity/duplications. Greater Boston Area will have a single id and the same will be used for all the profiles that match it.
  - This also helps updating cases, in which case you will have to update only one place (regions table) and the same will take effect for all users.
  - The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future
  - “Unfortunately, normalizing this data requires many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak”
  - On Document db(nosql), we don’t strongly support joins, you will have to pull all the data to the application and do post processing of joins within the application. This can be expensive some times.

Document model

{
  "user_id":     251,
  "first_name":  "Bill",
  "last_name":   "Gates",
  "summary":     "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id":   "us:91",
  "industry_id": 131,
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University",       "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog":    "https://www.gatesnotes.com/",
    "twitter": "https://twitter.com/BillGates"
  }
}

Details on Document model:
- a self contained document created in json format for the same schema detailed in above section and stored as a single entity
- the lack of schema in document model makes it easy to handle data in application layer
- document db follows a one to many relation model for the data of a user – where all the details of the user is present in the same object locally in a tree structure.
- In document db, an object is read completely at once. If the size of each object is very large, it is counter productive. So it is recommended to keep the objects small and avoid write to the same objects which will increase its size.

Query Optimizer:

Query Optimizer : When you fire a query which has multiple parts – where clause, from clause etc, the query optimizer decides which part to execute first in the most optimized way. These choices are called “access paths” which are decided by query optimizer. A developer will not have to worry about the access path as they are decided automatically. When a new index is introduced, the query optimizer makes a decision on if using it will be helpful, and takes that path automatically.
The sql doesn’t guarantee the results in any particular order. “The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations.”

Schema Flexibility:

in case of document db although they are called schemaless, that only means, there is an implicit schema for the data, just that it is not enforced by the db.
more like schema on read is maintained, rather than schema on write – meaning, when you read the data from the db in document db, you expect some kind of structure and the fields to exist on it.
when the format of the data changes, example: full name has to be split in to firstname and lastname – it is much easier on document DB, where the old exists as is, and the new data will have the new field. But in case of Relational database, you will have to perform migration of the schema for pre-existing data.

Graph like Data-models:

Disclaimer : I have just skimmed through this section, as I have not directly worked through on Graph model dbs.
when the data is of the type many-to-many relationship, then modeling data in form of graph makes more sense.
Typical examples for graph modeling usecases.
- Social medias – linking people together.
- Rail networks
- Web pages linked to each other
Structuring data of a couple in a graph like model

Summary:

Historically, data started out being represented as one big tree
Then Engineers found that most of the data is related to each other with many-to-many relationship. So Realtion Model (SQL) was invented
More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:
- Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
- Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything
All three models (document, relational, and graph) are widely used today, and each is good in its respective domain.
One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements
Each data model comes with its own query language or framework. Examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog

Weekly Bullet #40 – Summary for the week

Here are a bunch of Technical / Non-Technical topics that I came across recently and found them very resourceful.

Technical :

The shortest and comprehensive System Design Template for any new service – link here
Kafka is one of the most efficiently built transient datastore. This article explains the compute and storage layers of kafka — link here
Consistent Hashing has helped solve Distributed System with:
- even shard distribution across nodes in cluster
- minimum data movement on adding/removing nodes from the cluster
- A great explaination of consistent hashing on the link here
Picking a database is a long-term commitment. Below is very high level guiding-post. Please take it with a pinch of salt.

I have been geeking out on rate limiting and how it is implemented on large scale systems. Below are a few interesting references for the same:
- Stripe rate limiter : Scaling your API with rate limits — link here
- AWS : Throttle API requests for better throughput– link here
- Rate limiters set at Twitter — link here
- Out of all the available Rate limiting algorithms (Token bucket, Leaking bucket, Fixed window, Sliding Window etc) – Sliding Window is the most comprehensive which handles burst load — Sliding Window Explained here

Non-Technical :

Elon Musk’s biography by Walter Isaacson is out this week. Guess the first sentence in the book? — Amazon link here. Also one of the reviews here
“I re-invented electric cars and am sending people to mars…did you think I was also going to be a chill, normal dude?”
The Project Gutenberg Open Audiobook Collection and you can listen to those audio books on your spotify — link here
The difference between Measuring and Evaluating — link here
Quote from a book:

Did the person take 10 minutes to do their homework? Are they minding the details? If not, don’t encourage more incompetence by rewarding it. Those who are sloppy during the honeymoon (at the beginning) only get worse later
Tools of Titans

Cheers until next time !