A few short thoughts / ideas wrt of Performance centric product.
- In this world of infinite scaling of computes, pay close attention to common choke points. Like DB, storage(s) etc, which are shared by all the computes.
- Majority of the reads and writes have to happen in Bulk operations and NOT as single read/writes. Specially when there are 100’s-1000’s of reads/writes/deletes on storage(s).
- Threads. Pay close attention to which part of the entire flow is multi-threaded. Sometimes, only a small part of the flow is multi-threaded, but entire application is called multi-threaded, which is wrong.
- Don’t just look at throughput for certifying if an application is performing good / bad.
Scenario 1: Throughput:100docs/sec, Object size:100Kb
Scenario 2: Throughput:50docs/sec, Object size:200Kb
In both the cases about the amount of data transferred on network is 10,000Kb/sec (9.7Mb/sec). So pay attention to network stats and size of objects as well. - Databases. Make sure important collections have indexes. (There is so much more to DB tuning. This could be starting point)
- Introduce local cache on computes wherever possible to avoid network calls to DB’s. Be 100% sure on your logic on when to invalidate the cache. If we don’t invalidate the cache and use the stale cached data, that could lead to incorrect transactions.
- Failures are important. Pay close attention to reducing the failures. If something has to fail, then fail early on its compute cycle, so that CPU cycles are not wasted on failures.
- When you do retries due to failures in your system, make sure to have logarithmic back offs. Meaning, if a message failed, 1st retry after 1min, 2nd retry after 2mins, 3rd after 4mins etc. This give system time to recover. Also retry only for finite number of times.
- Pay very close attention on logic for when to mark a failure as Temporary vs Permanent. If a failure which should be permanent is wrongly marked as temporary, then the request keeps coming back to the compute via retries and wastes the CPU cycles.
- Timeouts and Max connection pool sizes! Make a note of all the timeout that are set across talks between different fabrics / api calls in your application. Also, see the configuration for connections pools used in the application. Both might result in spike of failures if crossed.
[Update:] - Redundancy, Isolation and Localization are the core of any reliable system.
Redundancy in the service to handle any failures.
Isolation to break transactions in to independent smaller services.
Localization so that you have all the required compute and data locally in the service with minimum/none external dependency.
Happy to converse about any in the comments.