Encoding: From the POV of Dataflow paths

When studying Chapter 4 of Designing Data-Intensive Applications (Encoding and Evolution), I quickly encounters a level of granularity that seems mechanical: binary formats, schema evolution, and serialization techniques. Yet behind this technical scaffolding lies something conceptually deeper. Encoding is not merely a process of serialization; it is the very grammar through which distributed systems express and interpret meaning. It is the act that allows a system’s internal thoughts — the data in memory — to be externalized into a communicable form. Without it, a database, an API, or a Kafka stream would be nothing but incomprehensible noise.

But, Why should engineers care about encoding? In distributed systems, encoding preserves meaning as information crosses process boundaries. It ensures independent systems communicate coherently. Poor encoding causes brittle integrations, incompatibilities, and data corruption. Engineers who grasp encoding design for interoperability, evolution, and longevity.

This writeup reframes encoding as a semantic bridge between systems by overlaying it with two mental models: the Dataflow Model, which describes how data traverses through software, and the OSI Model, which explains how those flows are layered and transmitted across networks. When examined together, these frameworks reveal encoding as the connective tissue that binds computation, communication, and storage.

So, What is Encoding ?

All computation deals with data in two representations: the in-memory form, which is rich with pointers, structures, and types meaningful only within a program’s runtime, and the external form (stored on disk / sent over network), which reduces those abstractions into bytes. The act of transforming one into the other is encoding; its inverse, decoding, restores those bytes into something the program can reason about again.

This translation is omnipresent. A database write, an HTTP call, a message on a stream — all are expressions of the same principle: in-memory meaning must be serialized before it can cross a boundary. These boundaries define the seams of distributed systems, and it is at those seams where encoding performs its essential work.

Some of the general encoding formats that are used across programming languages are JSON, XML and the Binary variants of them (BSON, Avro, Thrift, MessagePack etc).
– When an application sends data to DB, the encoded format is generally a Binary Variant (BSON in mongo)
– When service 1 sends data to service2 via an api payload, the data could be encoded as JSON within the request body.

The Dataflow Model: Where Encoding Occurs

From the perspective of a dataflow, encoding appears at every point where one process hands information to another. In modern systems, these flows take three canonical forms:

Application to Database – An application writes structured data into a persistent store. The database driver encodes in-memory objects into a format the database can understand — BSON for MongoDB, Avro for columnar systems, or binary for relational storage.
Application to Application (REST or RPC) – One service communicates with another, encoding its data as JSON or Protobuf over HTTP. The receiver decodes the request body into a native object model.
Application via Message Bus (Kafka or Pub/Sub) – A producer emits a serialized message, often governed by a schema registry, which ensures that consumers can decode it reliably.

In all these flows, encoding happens at the application boundary. Everything beneath — the network stack, the transport layer, even encryption — concerns itself only with delivery, not meaning. As DDIA succinctly puts it: “Meaningful encoding happens at Layer 7.”

With those above details, lets expand a little in detail about two Dataflow paths and see how Encoding happens.
(1) Application to Database
(2) Application to Application

Application to Database Communication:

In the case of application-to-database communication, encoding operates as a translator between the in-memory world of the application and the on-disk structures of the database. When an application issues a write, it first transforms its in-memory representation of data into a database-friendly format through the database driver. The driver is the actual component that handles the encoding process. For instance, when a Python or Java program writes to MongoDB, the driver converts objects into BSON—a binary representation of JSON—before transmitting it over the network to the MongoDB server. When the database returns data, the driver reverses the process by decoding BSON back into language-native objects. This process ensures that the semantics of the data remain consistent even as it moves between memory, wire, and storage.

Encoding at this layer, though often hidden from us, is critical for maintaining schema compatibility between the application’s data model and the database schema. It allows databases to be agnostic of programming language details while providing efficient on-disk representation. Each read or write is therefore an act of translation: from structured programmatic state to persistent binary form, and back.

Mermaid of Encoding Dataflow path : Application –> DB

Application to Application Communication:

When two applications exchange data, encoding ensures that both sides share a consistent understanding of structure and semantics. In HTTP-based systems, Service A (client) serializes data into JSON and sends it as the body of a POST or PUT request. The server (Service B) decodes this payload back into an internal data structure for processing. The HTTP protocol itself is merely the courier—the JSON payload is the encoded meaning riding inside the request. This pattern promotes interoperability because nearly every platform can parse JSON.

S1 serializes payload → JSON text (this is the endcoding part)
HTTP sends that text as the body request (this is imp part which I missed earlier)
S2’s HTTP server framework reads it and parses it into native objects

In contrast, systems employing gRPC communicate using Protocol Buffers, a binary schema-based format. gRPC compiles the shared .proto file into stubs for both client and server, ensuring a strong contract between them. When Service A invokes a method defined in this schema, the gRPC library encodes the message into a compact binary stream, transmits it via HTTP/2, and Service B decodes it according to the same schema. The encoding format—textual JSON for REST or binary Protobuf for gRPC—defines not only the data structure but also the performance characteristics and coupling between services.

The OSI Model: Layers of Translation

If you note in the details of above section, most of the encoding we discuss is at Layer7 (application layer). Hence the protocols that we talk about are all L7 protocols – HTTP, gRPC etc.

With that point as note, I tried to overlay the mental model of OSI Networking model on top of Encoding model, to understand better and stitch them together.

While most of the translation of data during encoding happens at L7, other layers with in OSI model also do their own form of encoding. Each layer wraps the one above it like a nested envelope, performing its own encoding and decoding. But while Layers 1–6 ensure reliable delivery, only Layer 7 encodes meaning. A JSON document or a Protobuf message exists entirely at this level, where software systems express intent and structure.

Layer 7  Application   → HTTP + JSON / gRPC + Protobuf
Layer 6  Presentation  → TLS encryption, UTF‑8 conversion
Layer 5  Session       → Connection management
Layer 4  Transport     → TCP segmentation, reliability
Layer 3  Network       → IP addressing, routing
Layer 2  Data Link     → Frame delivery, MAC addressing
Layer 1  Physical      → Bits on wire, voltage, light

Example : Workflow

With above details, lets try a usecase of encoding between two services, which are taking to each other in a restful way via apis. (s1 and s2)
Lets plot the flow diagram of encoding with OSI model on top of it

Mental Models:

To conclude, below are the mental models to think through, when considering Encoding:

Different dataflow patterns (app--> DB, app--> app, app --> kafka --> app)
Encoding at different OSI layers (L7 all the way till L1)

Other Artifacts:

All the referenced papers and material from DDIA – https://github.com/ept/ddia-references/blob/master/chapter-04-refs.md
Roy Fielding publication on what is considered a RESTFUL service – https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf
DDIA book – https://dataintensive.net/

[DDIA Book] : Data Models and Query Languages

[Self Notes and Review]:

This is a second writeup in the series of reading DDIA book and publishing my notes from the book.
The first one can be found here

This particular article is from the second chapter of the book. Again, these are just my self notes/extracts and treat this more like an overview/summary. Best way is to read the book in itself.

This chapter dwells in to the details of: the format in which we write data to databases and mechanism by which we read it back.

First – a few terminologies:

Relational Database – which has rows and columns and a schema for all the data
- Eg: SQL
Non-Relational Database – also known as Document model , Nosql etc , targeting the use case of data comes in self-contained documents and relation between one document to other are rare.
- Eg: Mongo, where the data is stored as a single entity like json object in Mongo
Graph Database – where all the data is stored as a vertex and edge, targeting the use case of anything is potentially related to everything.
- Eg: Neo4j, Titan, and InfiniteGraph
Imperative language – in an Imperative language like a programming language, you tell the compute what to do and how to do. Like, get the data and go over the loop twice in a particular order etc
Declarative language – in Declarative query language like SQL used for retrieving data from database, you know tell it what to do – and how to do is decided by the query optimizer.

Relational Model(RDBMS) vs Document Model

SQL is the best know RDBMS which has lasted for over 30years.
NOSQL – is more of an opposite of RDBMS. And sadly the name “nosql” doesn’t actually refer to any particular technology. It is more of a blanket terms for all non-relational databases.
Advantages of Document (Nosql) over Relational (RDBMS) type databases:
- Ease to scale out in no sql like mongo, where you can add more shards – but in sql type database (relational rdbms type) – they are designed more to scale vertically.
- ability to store unstructured, semi structured or structured data in nosql – while in rdbms you can store only structured data
- Ease of updating schema in no sql – like in mongo you can insert docs with new field and it will work just fine
- you can do blue green deployment in nosql by updating one cluster at a time, but in nosql – you have to take down the system
- https://www.mongodb.com/nosql-explained/advantages
Disadvantages of Document (Nosql) over Relational (RDBMS) type databases
- you cannot directly pick a value from a nested json in document db(you need nested references). In Relation db, you can pick a specific value from its column and row criteria.
- The poor support for joins in document databases may or may not be a problem, depending on the application.

[Use-case]: Relational vs Document models for implementing linkedin design:

Relational Model
- in a relational model like SQL – many user_id can be used as an unique identifier across multiple tables
- region and industries are common tables which can be used for different users
- IMPORTANT : in the above example – users table has region_id and industry_id – i.e, it has an id and not the common free text.
  - This helps maintain consistency and avoid ambiguity/duplications. Greater Boston Area will have a single id and the same will be used for all the profiles that match it.
  - This also helps updating cases, in which case you will have to update only one place (regions table) and the same will take effect for all users.
  - The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future
  - “Unfortunately, normalizing this data requires many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak”
  - On Document db(nosql), we don’t strongly support joins, you will have to pull all the data to the application and do post processing of joins within the application. This can be expensive some times.

Document model

{
  "user_id":     251,
  "first_name":  "Bill",
  "last_name":   "Gates",
  "summary":     "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id":   "us:91",
  "industry_id": 131,
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University",       "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog":    "https://www.gatesnotes.com/",
    "twitter": "https://twitter.com/BillGates"
  }
}

Details on Document model:
- a self contained document created in json format for the same schema detailed in above section and stored as a single entity
- the lack of schema in document model makes it easy to handle data in application layer
- document db follows a one to many relation model for the data of a user – where all the details of the user is present in the same object locally in a tree structure.
- In document db, an object is read completely at once. If the size of each object is very large, it is counter productive. So it is recommended to keep the objects small and avoid write to the same objects which will increase its size.

Query Optimizer:

Query Optimizer : When you fire a query which has multiple parts – where clause, from clause etc, the query optimizer decides which part to execute first in the most optimized way. These choices are called “access paths” which are decided by query optimizer. A developer will not have to worry about the access path as they are decided automatically. When a new index is introduced, the query optimizer makes a decision on if using it will be helpful, and takes that path automatically.
The sql doesn’t guarantee the results in any particular order. “The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations.”

Schema Flexibility:

in case of document db although they are called schemaless, that only means, there is an implicit schema for the data, just that it is not enforced by the db.
more like schema on read is maintained, rather than schema on write – meaning, when you read the data from the db in document db, you expect some kind of structure and the fields to exist on it.
when the format of the data changes, example: full name has to be split in to firstname and lastname – it is much easier on document DB, where the old exists as is, and the new data will have the new field. But in case of Relational database, you will have to perform migration of the schema for pre-existing data.

Graph like Data-models:

Disclaimer : I have just skimmed through this section, as I have not directly worked through on Graph model dbs.
when the data is of the type many-to-many relationship, then modeling data in form of graph makes more sense.
Typical examples for graph modeling usecases.
- Social medias – linking people together.
- Rail networks
- Web pages linked to each other
Structuring data of a couple in a graph like model

Summary:

Historically, data started out being represented as one big tree
Then Engineers found that most of the data is related to each other with many-to-many relationship. So Realtion Model (SQL) was invented
More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:
- Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
- Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything
All three models (document, relational, and graph) are widely used today, and each is good in its respective domain.
One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements
Each data model comes with its own query language or framework. Examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog

[DDIA Book]: Reliable, Scalable and Maintainable Application

[Self Notes and Review]:

This is a new series of publications where I am publishing my self notes/extracts from reading the very famous book – DDIA (Designing Data-Intensive Applications) by Martin Kleppmann.

This particular article is from the first chapter of the book. Again, these are just my self notes/extracts and treat this more like an overview/summary. Best way is to read the book in itself.

Side note: I am a terribly slow and repetitive reader. The update between chapters might take weeks.

Reliable, Scalable and Maintainable Applications

CPU not a constrain any more in computing. CPUs these days are inexpensive and more powerful.
General problems these days are complexity of data, amount of data and rate at which the data changes.
Below are the common functionalities of a data intensive application
- Store data so that they, or another application, can find it again later (databases)
- Remember the result of an expensive operation, to speed up reads (caches)
- Allow users to search data by keyword or filter it in various ways (search indexes)
- Send a message to another process, to be handled asynchronously (stream processing)
- Periodically crunch a large amount of accumulated data (batch processing)
[not imp but note]: “Although a database and a message queue have some superficial similarity—both store data for some time—they have very different access patterns, which means different performance characteristics, and thus very different implementations.”
- “there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.”

Reliability

“The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).”
“The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.”
“Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user”
Every software system must be designed to tolerate some kind of failures rather than preventing every – but some kind of failures are better prevented – Example: Security related failures.
Hardware Faults
- hard disk crash, RAM fautly, power grid failure, someone unplugging wrong network cable
- “Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years [5, 6] Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.”
- for hardware failure the first solution is build redundency in the software to handle the failure of one hardware component. Having replicas. Also, Being software tolerant for hardware failures, Example: make the system read only when more than 2/3 nodes are down.
- “On AWS it is fairly common for virtual machine instances to become unavailable without warning [7], as the platforms are designed to prioritize flexibility and elasticity over single-machine reliability.”
Software Faults
- There can be systemic errors in the system that can cause all the nodes of a cluster to go down as a repel effect. Example: 1 nodes on DB cluster – and all the heavy queries that killed the node1 are now shifted node2. The cluster now has one less node but has to deal with all the load – leading to failure of other nodes.
- “The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances.”
- “There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.”
Human Errors
- “Even when they have the best intentions, humans are known to be unreliable”
- 10-25% of internet outages are due to wrong configuration by humans.
- Some ways to consider in design
  - “Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.”
  - A staging env for people to try , explore and fail
  - Testing deeply
  - Make the recovery easy – roll back should be always faster
  - “Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry.”

Scalability

“As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.”
one common reason for degraded performance of an application is – higher load/users that the system is designed for. Applications handling more data than it did before.
Questions to consider during the design of a scalable application: “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”
Consider Twitter system design solution for scalability
- Twitter has two main operations –
  - (1) Post a Tweet – (4.6k requests/sec on average, over 12k requests/sec at peak)
  - (2) Pull the timeline – (300k requests/sec).
- So, most of the operations are around pull timeline – i.e, reading the tweets. Twitter’s challenge is not around handling the number of people who tweet, but around number of people who read and pull those tweets on their timelines.
- There are two ways to implement the solution.
  1. everytime someone tweets, write it to a DB. When the follower pull their timeline, pull that tweet from the DB
  2. every time someone tweets, deliver it to all their followers more like a mail – keep it some each user cache. So when the followers pull the timelines, the tweets come from their cache instantly.
- Option 2 is more effective because – the number of people who are tweeting are less, but the number of people who are pulling the timeline are more. But this means there will be more work now when tweeting, second order effect.
  - lets say, I have 10million followers. So when I tweeet, I have to update the cache of 10million followers, so that when they pull their timeline, the tweets are ready.
  - to avoid that – a hybrid model can be followed. If the user has more than, lets say, 5million followers – update their tweets to a common cache. So when the user pull the timeline – use both option 1 and 2 and merge them based on the people they are following.
Average response times – and why you should avoid “average” in general.
- Avrg are the worst. They take in to account all the outliers and screw up the avrg number reported. Avrg doesn’t tell you how many users actually experienced the delay.
- Average is nothing but the arithmetic mean. (given n values, add up all the values, and divide by n.)
- “Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: for example, if your median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that”
  - note : Median is same as P50 – 50th percentile
- “if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more”
- also, Latency and Response time are not the same. Response time is what the client sees(processing time+network time+client render time). Latency is however the time spent by the request waiting to be served. Latent – awaiting service.
- more on percentile – here
Two imp questions to answer during performance testing
- If I increase the load without increasing the system resources, how is the performance of the system affected? Is it usable at all?
- How much of the resources and what all services have to be scaled when the load increases, so that the performance of the application is not degraded?
“a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same data throughput”

Maintainability

I thought this section made a lot of obvious commentary. Skimmed and skipped most.
“Over time, many different people will work on the system and they should be able to work on it productively”
Has three parts to it. Operability, Simplicity, Evolvability
Operability
- Monitoring, Tracking, Keeping software uptodate, SOPs, Security updates, Following best practices, Documentation
Simplicity
- reduce accidental complexity by introducing good abstraction.
Evolvability
- services should be created independent
- follow microservice and 12 Factor app rules to keep them truly independent

See you in the next chapter.