Prometheus: Static/Dynamic scraping on EKS

This note is a mental model for how Prometheus discovers and scrapes metrics in Kubernetes.

The lens I want to keep throughout is:

  1. Where will the scrape config file sit?
    (Prometheus repo vs application repo)
  2. In which namespace will the serviceMonitor sit?
    (and how Prometheus finds it)

At a high level there are two ways to tell Prometheus about a /metrics endpoint:

  1. Static via in the Prometheus config file.
  2. Dynamic via (CRD from Prometheus Operator) with label‑based discovery.
  1. Approach 1: Static scrape_config:
    1. How question:
    2. Where question:
  2. Approach 2: Dynamic scraping with ServiceMonitor
    1. How question:
    2. Where question:
      1. (1) Watch service monitor in all Namespaces.
      2. (2) Watch only specific Namespaces
      3. (3) Watch only Prometheus Namespace.
    3. Below is the workflow of Dynamic, label-based Discovery of a scrape endpoint:

Approach 1: Static scrape_config:

This is the traditional way: you manually configure a scrape job in prometheus.yaml.
The natural questions here are:

  • How do I configure the scrape_config job?
  • Where will that config YAML file sit? (Prometheus repo or application repo?)

How question:

You define a scrape_configs job in the Prometheus config:

scrape_configs:
  - job_name: 'otel-collector'
    scrape_interval: 120s
    static_configs:
      - targets: ['otel-collector:9090']
    metrics_path: '/metrics'

Prometheus will then call: GET http://otel-collector:9090/metrics on configured interval.
An example in otel-contrib repo – here

Where question:

In this model the scrape config is centralized in the Prometheus deployment repo.

prometheus-deployment-repo/            #Platform team owns
├── prometheus.yaml                    # Main config with scrape_configs
├── prometheus-deployment.yaml
└── prometheus-configmap.yaml

otel-collector-repo/                   # App team owns
├── deployment.yaml
├── service.yaml
└── otel-collector.yaml
# NO prometheus.yaml here

This is a very manual contract between the application and the Prometheus and potentially has problems with Day1 and Day2 operations, as the application evolved.
Everytime there is a change in applicationName or Port, the same changes will have to put on the Prometheus repo as well. Failure of which will lead into scrape failures.

Approach 2: Dynamic scraping with ServiceMonitor

Rather than manual, there is also an option of Dynamic discovery of a new endpoint on EKS that prometheus can figure out it has to scraped.
This is done via ServiceMonitor which is a CRD from Prometheus-operator. Details here

Once serviceMonitor CRD is present in the cluster, the Prometheus-operator then:

  • Watches ServiceMonitor resources in the EKS cluster.
  • Translates them into Prometheus scrape configs.
  • Updates the Prometheus server configuration automatically.
    Note that, if ServiceMonitor CRD is present, other tools can just set the variable true for serviceMonitor in their helm charts and start getting scraped by prometheus.

Looking through serviceMonitor from the lens of :
How to configure serviceMonitor for my app ?
Where (which namespace) should I push my serviceMonitor to ?

How question:

Configure a kind: ServiceMonitor kubernetes resource in the application repo. Example of the same below.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: otel-demo
  labels:
    prometheus: fabric        # Must match Prometheus selector
spec:
  selector:
    matchLabels:
      app: otel-collector     # Selects Services with this label
  namespaceSelector:
    matchNames:
      - otel-demo
  endpoints:
    - port: metrics           # Port NAME (not number!)
      interval: 120s

So the How for dynamic scraping is: “Define a ServiceMonitor next to your Deployment and Service in your app repo.”

Where question:

With static scrape_config, the “where” question was: which repo owns the config file?With ServiceMonitor, the YAML lives in the app repo, so the “where” question shifts to:

In which Kubernetes namespace should I deploy the serviceMonitor so that Prometheus discovers it?

This is controlled by the serviceMonitorNamespaceSelector field in the Prometheus CRD.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: platform-prometheus
  namespace: prometheus-platform
spec:
  serviceMonitorSelector:
    matchLabels:
      prometheus: platform       # only watch SMs with this label
  serviceMonitorNamespaceSelector: {}

Prometheus can be setup in 3 different ways to do this:

(1) Watch service monitor in all Namespaces.

Prometheus Operator watches every namespace for ServiceMonitors with label prometheus: fabric in below example.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: fabric-prometheus
  namespace: prometheus-fabric
spec:
  serviceMonitorSelector:
    matchLabels:
      prometheus: fabric
  serviceMonitorNamespaceSelector:
    {} # Empty = all namespaces
(2) Watch only specific Namespaces

Only ServiceMonitors in namespaces labeled monitoring: enabled are picked up in below example.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: fabric-prometheus
  namespace: prometheus-fabric
spec:
  serviceMonitorSelector:
    matchLabels:
      prometheus: fabric
  serviceMonitorNamespaceSelector:
    matchLabels:
      monitoring: enabled  # Only namespaces with this label
(3) Watch only Prometheus Namespace.

ServiceMonitors must be in prometheus-fabric namespace (centralized model) in below example

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: fabric-prometheus
  namespace: prometheus-fabric
spec:
  serviceMonitorSelector:
    matchLabels:
      prometheus: fabric
  serviceMonitorNamespaceSelector:
    matchNames:
      - prometheus-fabric  # Only this namespace

Below is the workflow of Dynamic, label-based Discovery of a scrape endpoint:

In below workflow example, prometheus is deployed in prometheus-namespace and the application is deployed in otel-demo.
The workflow shows tying up between :
Promethues –> ServiceMonitor –> Service –> Pods

┌─────────────────────────────────────────────────────────────┐
│ 1. Prometheus (namespace: prometheus-namespace)               │
│                                                             │
│    serviceMonitorSelector:                                  │
│      matchLabels:                                           │
│        prometheus: fabric  ◄────────────────────┐          │
│                                                  │          │
│    serviceMonitorNamespaceSelector: {}          │          │
│    (empty = watch all namespaces)               │          │
└──────────────────────────────────────────────────┼──────────┘
                                                   │
                    ┌──────────────────────────────┘
                    │ Operator watches all namespaces
                    ▼
┌────────────────────────────────────────────────────────────┐
│ 2. ServiceMonitor (namespace: otel-demo)                   │
│                                                            │
│    metadata:                                               │
│      labels:                                               │
│        prometheus: fabric  ◄─────────────────┐             │
│    spec:                                     │             │
│      selector:                               │             │
│        matchLabels:                          │             │
│          app: otel-collector  ◄──────────┐   │             │
└───────────────────────────────────────────┼───┼────────────┘
                                            │   │
                    ┌───────────────────────┘   │
                    │ Selects Services          │
                    ▼                           │
┌────────────────────────────────────────────────────────────┐
│ 3. Service (namespace: otel-demo)                          │
│                                                            │
│    metadata:                                               │
│      labels:                                               │
│        app: otel-collector  ◄────────────────┐             │
│    spec:                                     │             │
│      selector:                               │             │
│        app: otel-collector  ◄────────────┐   │             │
│      ports:                              │   │             │
│        - name: metrics  ◄────────────┐   │   │             │
│          port: 9090                  │   │   │             │
└──────────────────────────────────────┼───┼───┼─────────────┘
                                       │   │   │
                    ┌──────────────────┘   │   │
                    │ Port name match      │   │
                    │                      │   │
                    │  ┌───────────────────┘   │
                    │  │ Selects Pods          │
                    ▼  ▼                       │
┌────────────────────────────────────────────────────────────┐
│ 4. Pod (namespace: otel-demo)                              │
│                                                            │
│    metadata:                                               │
│      labels:                                               │
│        app: otel-collector  ◄────────────────┐             │
│    spec:                                     │             │
│      containers:                             │             │
│        - ports:                              │             │
│            - name: metrics                   │             │
│              containerPort: 9090  ◄──────────┼─────────────┤
└──────────────────────────────────────────────┼─────────────┘
                                               │
                    ┌──────────────────────────┘
                    │ Prometheus scrapes
                    ▼
              GET http://10.0.1.50:9090/metrics

Kubecon India : 2025

I attended Kubecon India 2025 held in Hyderabad this year. I mainly focused and attended talks related to Observability and Scalable designs.

Now that the sessions are uploaded to Youtube, linking the ones which I really enjoyed.

  • Observability at Scale With Monitoring as Code: Grafana, Prometheus, & Tempo – Vipin GopalaKrishnapillai & Saiabhinay Bommakanti, Amway Global – link
  • Predictable auto scaling with keda – link
  • Observability – tenant centric metrics – link
  • Building observability platform for Edge compute nodes – link
  • [I really liked the way the idea was presented on this] – Auto instrumentation for GPU performance – link
  • Enhancing DNS Reliability in Kubernetes with node Local DNS Cache link

The entire playlist of all the talks is available here:
https://www.youtube.com/watch?v=hMn8qiLSE0A&list=PLj6h78yzYM2MEQTMX_LIOK1hrePHxLD6U

Design Philosophy: Observability

I enjoy philosophy. Stoic philosophy in particular.
Philosophy, I think, helps us revalidate our purpose. It acts as a yard stick and makes sure that we are not moving away from our First-Principles.

Applying the same to Software Engineering, in my opinion, every team should have a “Design Philosophy”. What is that one yard stick which teams can use for making better decisions.
Infact, it is done in some forms in a few cases. Some call it Guiding-Principles. Some call it MVPs. I call it “Design Philosophy“.

The core idea is – Whenever a decision has to be made, if it is passed through this “Design Philosophy”, it should produce the same result, irrespective of who is making that decision.

As Engineer, we like equations, formulas and non-ambiguous ways of thinking. A written form of these Design Philosophies, for teams, does a lot of good in making the right decisions at a great pace. It is unfair for all Engineering teams to use the same yard stick. So people should write their own.

Below is my (opinionated) version for an Observability Engineering team.

  1. Low latency is an important features for Observability signals. The ingested observability data should be available to the users at the earliest.
  2. Observability tooling system is the torch in the dark. High reliability is a must. It cannot fail when the Platform / Application fails.
    • which means, Observability stack CANNOT fail when applications fail
    • which means, ideally, Observability stack shouldn’t be completely on the same platform as all Applications.
    • which means, Observability Vendors( buy decisions ) are not a bad choice. The choice of a vendor should be cost-effective for the Org.
    • for the O11y solutions that we decide to build in-house, Isolation is key.
  3. Our O11y stack should support – availability, reliability and performance “cost-effectively“ at scale.
  4. All the tools that we build and maintain should be vendor agnostic (sdks, collectors, refinery etc).
  5. Rate of decay of data is fast in O11y. People care more about last 1hour/1day O11y data vs last 1month data.
  6. When we opt into optimising cost in observability, it results in having more than one tool. While we can have different tools, we shouldn’t have many tools which do the same thing. Example:
    1. Metrics → Prometheus, Traces → Jaeger (Fine)
    2. Logs → ELK, Logs → Splunk (NOT-Fine)
  7. Tools change. The tools that we have today for a specific function, might change to something else in a year or two. Observability team should strive to make the change least disruptive.
  8. There is a clear view point on “what kind of observability signal, has to go where”. (Details on this here) Example:
    1. count –> metrics
    2. time –> trace
    3. high cardinality –> log

These are the elements that I use when making an Observability decision. These might vary for a different team, who might be in a different situation. But the point I am really trying to make is, have a design philosophy that will make decision making easier.

When to Emit What O11y Signal?

The intention of this page is to put together the Observability Signal Guidelines which will provide the required visibility into the systems without hurting the cost aspect of the solution.

Three basic observability signals that any application emits are:

  • Metrics,
  • Traces and
  • Logs

The general question is – When to emit what signal?

The answer lies in the intent behind the signal being emitted. What do you intend to measure with the Observability signal that you are emitting?

Below is a rule of thumb which can help answer this.


Rule of thumb:

Metrics:

If you want to measure anything as count , metrics is the best way to do it. Any question that starts as “How many ….” – metrics are a good choice.

  • some example measure could be :
    • number of documents process
    • throughput of an application
    • number of errors
    • kafka lag for a topic

Note: Please be careful of not including high cardinality tags on metrics.

Traces:

If you want to measure anything as an element of time, it should be a trace signal.

  • some examples:
    • end to end time of a document through an app (trace)
    • time taken by a part of transaction (span)
    • anything that needs high cardinality tags

Note: Traces are sampled. But sampling is not a bad thing. With time as a unit of measure in traces/span, trace will show when something is slow, but might miss the peak (max) values by a small margin.

Below Graph shows that sampling will not miss indicating the slowness seen in latencies.

Logs:

If you want to emit signals of high cardinality and don’t want it sampled, logs are your friends. The definition of high cardinality could be documentId, gcid etc, where we are measuring things at the smallest entity.

  • some example:
    • time taken for processing per request-id
    • tracking the flow path of a request with attributes like request-id, attachment types etc.

Logs have a few advantages as observability signals:

  • with custom-sdk (or Otel-sdk), you can emit logs with least boiler plate code.
  • with logs being structured via an sdk, there is scope for building post processors on top of logs
  • AI capabilities are planned on top of logs, if they are emitted via an sdk.

Emitting logs is debug mode for a long duration of time is not the definition of high-cardinality and should be avoided.


Below is a summary table on when to emit what Observability signal:

SignalWhen to use?Retention
MetricOn measuring count signalLong (few months)
TraceOn measuring time signalShort (few weeks)
LogOn measuring high cardinality and non-sampled signalSuper short (few days)

If you notice closely, as the attributes on an O11y-signal increases (tags/metadata associated with a signal), it becomes more useful when getting to know the state of the system. But also, at the same time, it increases the cost of that O11y-signal.
So, it is a natural effect that retention of an O11y-signal decreases as the cardinality of its metadata increase.

This has magically worked well as it doesn’t compromise on context of a O11y-signal (attributes/tags etc), at the same time takes care of cost aspect.

Enhancing Observability with OTel Custom Processors

Observability is crucial for modern distributed systems, enabling engineers to monitor, debug, and optimize their applications effectively. OpenTelemetry (Otel) has emerged as a comprehensive, vendor-neutral observability framework for collecting, processing, and exporting telemetry data such as traces, metrics, and logs.

This blog post will explore how custom processors in OpenTelemetry can significantly enhance your observability strategy, making it highly customizable and powerful.

The repo link where I have implemented a very simple Otel-Custom-Processor.
https://github.com/AkshayD110/otel-custom-processor/tree/master

Quick Introduction to OpenTelemetry (Otel)

OpenTelemetry simplifies observability by providing a unified approach to collect, manage, and export telemetry data. By standardizing telemetry practices, it bridges the gap between applications and observability tools, making it easier to understand complex systems.

Core OpenTelemetry Components

OpenTelemetry mainly comprises:

  • Exporters: Send processed telemetry data to monitoring and analysis systems.
  • Collectors: Responsible for receiving, processing, and exporting telemetry.
  • Processors: Offer the ability to manipulate, filter, and enrich telemetry data between receiving and exporting.
  • SDKs: Libraries to instrument applications and produce telemetry.

Refer to the official OpenTelemetry documentation for more details.

Building a Custom Processor with OpenTelemetry

Custom processors are powerful because they allow you to tailor telemetry data processing exactly to your needs. The simplicity of creating custom processors is demonstrated in this custom processor GitHub repository.

This repository demonstrates building a simple metrics processor that implements the Otel processor interface. Specifically, the provided example logs incoming metrics to the console, illustrating how straightforward it is to start building custom logic.

Here’s the essential snippet from the repo:

func (cp *CustomProcessor) ConsumeMetrics(ctx context.Context, md pdata.Metrics) error {
	// Example logic: printing metrics
	return cp.next.ConsumeMetrics(ctx, md)
}

You can review the detailed implementation here.

This example serves as a foundational step, but you can easily enhance it with more complex functionality, which we’ll discuss shortly.

Integrating Your Custom Processor into OpenTelemetry Collector

Integrating your custom processor involves a few straightforward steps:

  1. Clone the OpenTelemetry Collector Contrib repository.
  2. Update the go.mod file to reference your custom processor package.
  3. Register your processor within the collector configuration.
  4. Rebuild the collector binary (e.g., using make build).
  5. Create a Docker image that includes your custom collector.

Note that you have to build the custom processor along with other otel components, but not individually and independently. They all work well together.

Practical Uses of Custom OpenTelemetry Processors

Beyond simple logging of metrics show above, custom processors unlock numerous powerful use cases. Here are some practical examples:

1. Metric Filtering

Filter telemetry data selectively based on criteria like metric names, threshold values, or specific attributes, helping reduce noise and operational costs. You get to control what goes to the Observability backend.

2. Metric Transformation

Transform metrics to standardize data units or restructure attributes, making your monitoring data consistent and meaningful.

3. Aggregation

Aggregate metrics across various dimensions or intervals, such as calculating averages or rates, to generate insightful summaries.

4. Enrichment

Augment metrics with additional metadata or context, aiding quicker diagnosis and richer analysis. Add the groupnames and tags.

5. Alerting

Embed basic alerting logic directly into your processor, enabling rapid response when thresholds are breached.

6. Routing

Route specific metrics to distinct processing pipelines or different monitoring backends based on defined attributes, enhancing management and optimization.

7. Caching

Cache telemetry data temporarily to enable sophisticated analytical operations like trend analysis or anomaly detection. Can be further extended to build a Transformation layer.


Conclusion:

OpenTelemetry custom processors offer exceptional flexibility, enabling personalized and efficient telemetry management. By incorporating custom logic tailored to your specific needs, you unlock deeper insights and enhance your overall observability.

Explore the custom processor repository today and start customizing your observability strategy!

Resources and references:

Memory management : Java containers on K8s

This page documents a few aspects of memory management on Java containers on K8s clusters.

For java containers, memory management on K8s have various factors:

  • Xmx and Xms limits managed by java
  • Request/limit values for the container
  • HPA policies used for scaling the number of pods

Misconfigurations / misunderstanding of any of these parameters leads to OOMs of java containers on K8s clusters.

Memory management on java containers:

  • -XX:+UseContainerSupport is enabled by default form java 10+
  • -XX:MaxRAMPercentage is the jvm parameter that specifies the percentage value of limits memory defined on the container, that can be used by heapspace. Default value is 25%.
  • Example: if -XX:MaxRAMPercentage=75, and container memory limit is 3GB, then:
  • -Xmx=75% of 3GB = 2.25GB
  • Important point to note: MaxRAMPercentage is calculated on limits and not requests

K8s : requests/limits:

  • as shown above, the memory assignment for the container is based on the values set for limits configuration
  • However, for hpa to kick-in, requests is used for Kubernetes.
  • Example:
    • If you configure HPA for memory utilization at 70%, it calculates usage as:
    • Memory Usage % = (Current Usage / Requests) * 100
    • (1.8GB / 2GB) * 100 = 90% – results in hpa kicking-in
    • scaling would happen usage wrt request configuration
  • If requests.memory is set low (2GB) and limits.memory is high (3GB), HPA may scale aggressively because it calculates usage based on requests, not limits.
  • The only advantage of setting limit > request – is : if non-heap space growth increases, it will not crash the vm. That’s one of a case with less probability compare to heap space crash.

Ideally, to make things simpler, based on the historic usage of the application – set “request=limits” for memory usage on java container. This will simplify the Xmx, request, limits and hpa math.

For scaling apps, there is alway HPA which can increase instances based on usage.


Conclusion

  • for java containers on K8s, know the memory needs of your app and set “request=limits”
  • use hpa for scaling and not depend on “limits>request” for memory
  • containers: run them small and run them many (via scaling based on rules)

Observability – Metrics Madness

There isn’t a single book or article on observability (O11y) where there isn’t a mention of MELT. (Metrics, Events, Traces, Logs)
While these four are the building blocks of telemetry data in Observability, all the four components haven’t evolved at the same rate.

In this write-up, I delve into metrics in observability (mainly custom metrics) and argue how the overuse of metrics is turning into madness and directly impacting the cost of observability.


State of Metrics in Observability

Metrics are the core entity of O11y in any application or platform component.
For monitoring software, metric-based dashboards, SLOs, and alerts are far more common and are considered the current standard.
Most of the popular O11y vendors/solutions available in the market, like Prometheus, Datadog, Chronosphere, etc., are predominantly metric-heavy solutions.
Thanks to OTel, a standard is falling into place for metrics, and we are no longer tied to a single observability vendor for life.

Closer look at Metrics

When we speak about metrics, they generally fall into two categories:

  • Host metrics
  • Custom metrics

Host Metrics

Host Metrics are the system metrics that reflect the state of the system. These metrics help Engineers understand the Utilization and Saturation of the system. For example:

  • CPU : %used – Utilization , Load avrg – Saturation
  • Memory : Heap usage – Utilization, Swap usage – Saturation
  • Disk : Used space – Utilization, iops usage – Saturation
  • Network : Badwidth used
    Note that in all the above cases the Application Engineering team doesn’t have to generate any of these metrics. These are mostly system based (kube-state-metric) or agent-based provided by vendor (datadog-agent).

Metrics are traditionally stored in a Time Series Database (TSDB) which is datapoints collected and stored against time.
Any aggregation that has tobe performed on the Metrics are precomputed and aggregation is performed at the write time and not the query time. Meaning, if you are interested in getting 95th% of your metric, the aggregation needs to happen while writing the data to the TSDB. These aggregations are generally configurable. Here is the case of Datadog where you can define the percentile aggregation that you want for all your metrics.

NOTE: In the context of datadog, any new aggregation added is treated as a custom metrics and you will be charged for it. More on that later.

The important point here is that all aggregations happen at the write level, and not the read level for metrics in a TSDB.

Custom metrics

This is the section on which I want to elaborate.
These are the metrics which are emitted from within the application code. Custom metrics in an application always go through a journey.
No metrics –> Not enough metrics –> Who added these metrics? –> Remove these Metrics

The above journey is a mess mainly because of :

  • lack of proper review on adding metrics
  • lack of understand on the cost impact of adding metrics (they can be really expensive. More below)
  • lack of value proportion thinking of adding a metric

Custom metrics are typically trying to measure either :

  • the count of an event (example: items processed, failures occurred) or
  • unit of time taken for a transaction (example: latency)

In my opinion, the best kind of customer-metrics are those which are customer experience centric. Any metric that is just added to measure a new-feature should always be temporary in nature. I also have a strong opinion on the use of custom metric for measuring latency. I believe anything that we want to measure as a unit of time (latency, response time) – should be emitted as a trace signal rather than metric. This leads us into Metrics vs Trace debate.

Metric vs Traces

Metric vs Trace for measuring latencies.
Lets consider a use case of taking a long flight, which has two stop overs.

If measuring the total travel time was a usecase, it is so much easier to emit a trace for end-to-end time. Then you can also break the time for TravelA, TravelB, TravelC and Terminal wait times using Spans within a Trace.
Now imagine emitting a metric for this. You will end up emitting 5 metrics for this, which will all be independent and you have to stitch them some dashboard. It would be an absolute pain.

Some people would ask, “But what about sampling of traces?”
Yes, traces are sampled, but so what? With latency signal, you generally want to measure if a transaction was slow or if it took more than usual amount of time during some part of the day. Even with Sampling on the traces, you will be able to see the slowness for a given period of time. See the case below, which has two plots with 50% sampling. You can still see the high latency on sampled requests.

Also, traces are less expensive mostly, based on the tool you use. I am an advocate for traces as a signal for feature any latency/response-times for above reasons.

Cost of Custom metrics

Going back to custom metrics, they can get really expensive really quick, if not kept a close watch on. The billing for custom metrics is not very straight forward to understand.
For example – Datadog custom metrics billing page here could easily leave you wanting for more coffee.

At the crux of it, custom metrics billing depends on :

  • number of unique metrics that are emitted
  • number of unique tags that are associated with metrics

Since metrics are stored into a Time Series Database – TSDB (generally), custom metrics are always plotted against the function of time. When stored on a TSDB its a unique entry per epoc time and per unique attribute (tags) emitted on the metric (as see in the prometheus dump file below)

To explain in more simple terms, consider the below example of request.latency metric which is emitted from 2 hosts, measuring from 3 EndPoints and collecting two unique status codes.

source: datadog page

This above single request.latency custom metric will lead into 4 unique metrics.

  • host:Aendpoint:Xstatus:200
  • host:Bendpoint:Xstatus:200
  • host:Bendpoint:Xstatus:400
  • host:Bendpoint:Ystatus:200

Now lets say if you have 2000 hosts, you want to latency from 4 end points and include 5 different status codes, then that would be : 2000 X 4 X 5 = 40,000 Custom metrics from a single metric request.latency above.

On top of it, instead of Count metric – if you emit Histogram metrics, that would include “max, median, avrg, 95th %, count” for each metric. That mean 5 times more metrics being emitted : 40,000 X 5 = 200,000 Custom metrics from a single metric request.latency for histogram type.
More on metric-types here

This combination of metrics and tags leading to unique entries in a TSDB is called high cardinality. This leads to exponential increase in the cost of observability.

In the case of Datadog, 100-200 custom metrics are included per host. Above which they charge between 1-5$ per 100 custom metrics based on the plan. Any standard maturated product, lets assume has 100 custom metrics (they are alway in 1000s). A rough math on extrapolating above calculation will lead to 20million custom metrics. With the least pricing model, that is 20,000$/month bill.

The point I am trying to make here is, if custom metrics are added without giving it enough though or if incorrect metric type(link) is used, it can get really expensive. If you use wrong data model, you will suffer.


So what do we do with Metrics?

Custom metrics are decade old and well wetted signals in observability. I am not trying to make a case to move away form them altogether. However, if used incorrectly or overused for all cases, they can really come back and hurt us.
A few things that I recommend as take aways:

  • when measuring latency/response-time as a signal – always lean towards Traces. Don’t worry too much about sampling. It doesn’t matter as shown above.
  • use the right form of metrics. If you use Histogram without needing it, instead of Count, you are emitting 5Times more metrics.
  • use tags very cautiously on metrics. Avoid high cardinatlity tags like – epoc times, request id, keys, uuids on metrics.
  • be very mindful about the retention of these metrics. If you want larger retention, you will end up building a data-management team instead of Observability team.
  • Periodically review the metrics that are present in your system. If a metric is not a part of any Dashboard, Alerts or Monitors – you probably don’t need it. Remove them.
  • Monitor your custom metric usage on observability solutions. Have alerts and anomaly detection for this. Check this very closely like “there is no tomorrow”. You will be surprised by how bad it will get in not time, if you loose site of this.
  • many a times you might want to ingest tags on metrics, but not index them always (Context is what gives data meaning and power). You might want to index them only when absolutely needed. This mean you cannot group a metric based on “ingested” high cardinality tags, but it will reduce the cost by multiple folds.
  • poll less often or increase the window size for “not so critical” services
  • if you have the resource and bandwidth, build a transformation layer before storing the metric. Use the data on transformation layer for all alerting and plotting. Aggregate and store lesser data for long term retention. But this is a larger effort.

Most of the suggestions above are from my personal experience where I have implemented these changes. Happy to hear your thoughts in the comments below.