[DDIA Book]: Reliable, Scalable and Maintainable Application

[Self Notes and Review]:

This is a new series of publications where I am publishing my self notes/extracts from reading the very famous book – DDIA (Designing Data-Intensive Applications) by Martin Kleppmann.

This particular article is from the first chapter of the book. Again, these are just my self notes/extracts and treat this more like an overview/summary. Best way is to read the book in itself.

Side note: I am a terribly slow and repetitive reader. The update between chapters might take weeks.

Reliable, Scalable and Maintainable Applications

  • CPU not a constrain any more in computing. CPUs these days are inexpensive and more powerful.
  • General problems these days are complexity of data, amount of data and rate at which the data changes.
  • Below are the common functionalities of a data intensive application
    • Store data so that they, or another application, can find it again later (databases)
    • Remember the result of an expensive operation, to speed up reads (caches)
    • Allow users to search data by keyword or filter it in various ways (search indexes)
    • Send a message to another process, to be handled asynchronously (stream processing)
    • Periodically crunch a large amount of accumulated data (batch processing)
  • [not imp but note]: “Although a database and a message queue have some superficial similarity—both store data for some time—they have very different access patterns, which means different performance characteristics, and thus very different implementations.
    • “there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.”

Reliability

  • “The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).”
  • “The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.”
  • “Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user”
  • Every software system must be designed to tolerate some kind of failures rather than preventing every – but some kind of failures are better prevented – Example: Security related failures.
  • Hardware Faults
    • hard disk crash, RAM fautly, power grid failure, someone unplugging wrong network cable
    • “Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years [5, 6] Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.
    • for hardware failure the first solution is build redundency in the software to handle the failure of one hardware component. Having replicas. Also, Being software tolerant for hardware failures, Example: make the system read only when more than 2/3 nodes are down.
    • “On AWS it is fairly common for virtual machine instances to become unavailable without warning [7], as the platforms are designed to prioritize flexibility and elasticity over single-machine reliability.
  • Software Faults
    • There can be systemic errors in the system that can cause all the nodes of a cluster to go down as a repel effect. Example: 1 nodes on DB cluster – and all the heavy queries that killed the node1 are now shifted node2. The cluster now has one less node but has to deal with all the load – leading to failure of other nodes.
    • “The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances.
    • “There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.
  • Human Errors
    • “Even when they have the best intentions, humans are known to be unreliable”
    • 10-25% of internet outages are due to wrong configuration by humans.
    • Some ways to consider in design
      • “Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.”
      • A staging env for people to try , explore and fail
      • Testing deeply
      • Make the recovery easy – roll back should be always faster
      • “Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry.”

Scalability

  • “As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.”
  • one common reason for degraded performance of an application is – higher load/users that the system is designed for. Applications handling more data than it did before.
  • Questions to consider during the design of a scalable application: “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”
  • Consider Twitter system design solution for scalability
    • Twitter has two main operations –
      • (1) Post a Tweet – (4.6k requests/sec on average, over 12k requests/sec at peak)
      • (2) Pull the timeline – (300k requests/sec).
    • So, most of the operations are around pull timeline – i.e, reading the tweets. Twitter’s challenge is not around handling the number of people who tweet, but around number of people who read and pull those tweets on their timelines.
    • There are two ways to implement the solution.
      1. everytime someone tweets, write it to a DB. When the follower pull their timeline, pull that tweet from the DB
      2. every time someone tweets, deliver it to all their followers more like a mail – keep it some each user cache. So when the followers pull the timelines, the tweets come from their cache instantly.
    • Option 2 is more effective because – the number of people who are tweeting are less, but the number of people who are pulling the timeline are more. But this means there will be more work now when tweeting, second order effect.
      • lets say, I have 10million followers. So when I tweeet, I have to update the cache of 10million followers, so that when they pull their timeline, the tweets are ready.
      • to avoid that – a hybrid model can be followed. If the user has more than, lets say, 5million followers – update their tweets to a common cache. So when the user pull the timeline – use both option 1 and 2 and merge them based on the people they are following.
  • Average response times – and why you should avoid “average” in general.
    • Avrg are the worst. They take in to account all the outliers and screw up the avrg number reported. Avrg doesn’t tell you how many users actually experienced the delay.
    • Average is nothing but the arithmetic mean. (given n values, add up all the values, and divide by n.)
    • “Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: for example, if your median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that”
      • note : Median is same as P50 – 50th percentile
    • “if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more”
    • also, Latency and Response time are not the same. Response time is what the client sees(processing time+network time+client render time). Latency is however the time spent by the request waiting to be served. Latent – awaiting service.
    • more on percentile – here
  • Two imp questions to answer during performance testing
    • If I increase the load without increasing the system resources, how is the performance of the system affected? Is it usable at all?
    • How much of the resources and what all services have to be scaled when the load increases, so that the performance of the application is not degraded?
  • “a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same data throughput

Maintainability

  • I thought this section made a lot of obvious commentary. Skimmed and skipped most.
  • “Over time, many different people will work on the system and they should be able to work on it productively”
  • Has three parts to it. Operability, Simplicity, Evolvability
  • Operability
    • Monitoring, Tracking, Keeping software uptodate, SOPs, Security updates, Following best practices, Documentation
  • Simplicity
    • reduce accidental complexity by introducing good abstraction.
  • Evolvability
    • services should be created independent
    • follow microservice and 12 Factor app rules to keep them truly independent

See you in the next chapter.

3 thoughts on “[DDIA Book]: Reliable, Scalable and Maintainable Application

  1. Love it. Almost felt like I read the chapter too :).

    In defence of avg though, sometimes it is helpful to tune out the outliers so you can get a sense of the trend. For ex with increasing load how the avg time rises up. I find the percentile too jumpy in such a scenario.

    Like

Leave a comment