Metrics Types – When to use what ?

This writeup talks about different metric type that an application can emit as telemetry. It intends to cover the case of when to use which metric types and the usecase for them.

Metrics is one of the oldest forms of telemetry. There are many APM solutions, whose billing model runs on the number of metrics you send to them. But it is often not clear to app devs on what kind of metrics to emit from applications. Infact, the types of metrics can be confusing sometimes.

We go into the details of different metric below. When to use what and how it affects the billing based on metrics used in your observability solution. Note that, the metric types detailed below are generic nature. “Distribution” is a special metric type which is supported in certain tools like Datadog, but is not a native metric type in Opentelemetry.

Types of metrics:

There are 4 types of metrics, generally, that most APM tools support.

  • Count
  • Gauge
  • Histogram
  • Distribution

Count:

A counter metrics track the cumulative number of occurrences of an event. This is a telemetry which will only go up (or reset to zero on restart).
This is used to answer questions like:

  • How many errors have occurred in my system today ?
  • How many messages have we archived today ?
    In general, use this when you want to count discrete events like – HTTP requests, exceptions thrown, messages consumed by kafka, retried attempted.
    The value you pass here from application code is – “How many just happened”, not a running total.

Gauge:

A gauge metric is a point-in-time snapshot of a value that can go up or down. It helps answer – “What is X right now?”. Here only the last value reported is of interest.
This is used to answer questions like:

  • What is the current lag in kafka ?
  • What are the active DB connections consumed ?
    In general, used to measure current states like – queue depth, system usage (JVM heap used), thread pool size etc. Note that, this is always the measurement itself and not the delta.
    Important to note that if the value changes multiple times with the flush interval (10secs generally), only the last value survives. Gauge is always flushed as an individual value and not as an array of values.

Histogram:

A histogram captures the statistical distribution of values over a time interval.
Histogram need a little explanation of how it works. Lets say if the flush interval is 10secs, which is the interval of time at which telemetry is flushed to datadog agent. Within the window of 10secs all raw values that arrive, go into the memory buffer. Datadog agent computes aggregations (arg, median, percentile, max, count) locally on the host before sending them to datadog for these values, at every flush interval.
This is used to answer questions like:

  • How is the p95 latency of my requests ?
  • How is the payload size distribution for the incoming messages.
    In general, used to measure the shape of distribution on the telemetry.

Probably it is better for understanding if we take a concrete example for explaining Histogram.
Consider a metric put-object-time-taken, which is emitted as a histogram metric from the code. Within a flush time of 10secs, lets say below set of values are emitted for the metric.
Values received: [45, 120, 30, 88, 200, 15, 67, 150, 42, 95]
The Agent sorts them, computes aggregations, and sends 7 separate metrics to Datadog’s backend:

Sub-metric sent to DatadogComputationValue
put-object-time-taken.avgmean of all 10 values85.2
put-object-time-taken.counthow many values arrived10
put-object-time-taken.median50th percentile (middle value)77.5
put-object-time-taken.95percentile95th percentile200
put-object-time-taken.maxlargest value200
put-object-time-taken.minsmallest value15
put-object-time-taken.sumtotal of all values852

After sending these 7 numbers, the Agent throws away the raw values and starts fresh for the next 10-second window.

Important thing to note here is, datadog server never gets the raw values seen in the array of values above. Datadog only gets the post computed values. Also note that this is very datadog specific. It is datadog server which buffers the values in the memory and does the aggregation.

Distribution:

Distribution kind of metric are datadog specific. Otel doesn’t distinguish between Histogram and Distribution.

Distribution metrics are very similar to Histogram which provide different set of aggregations on the values. The key difference from Histogram is what values get sent to datadog.
In case of Histogram, the aggregations are precomputed at the datadog-agent locally before sending to datadog-server. The raw value are never sent to datadog-server. However, in case of Distribution metrics, all the raw values are sent to datadog server.
This is important if the same metric is getting emitted from more than one host. In histogram since the pre-aggregated values go to datadog-server, the accuracy is low. But in distribution metrics, since all the raw values go to datadog-server, it is of high accuracy.

Consider an example, where lets say an application is emitting a metric put-object-time-taken. This application, lets say, runs on 3 different hosts.

  • Host A values: [45, 120, 30, 88, 200, 15, 67, 150, 42, 95]
  • Host B values: [55, 80, 210, 33, 72, 140, 25, 99, 180, 60]
  • Host C values: [38, 105, 22, 190, 48, 130, 70, 85, 155, 44]
    In distribution metrics, all the 30 raw values from all 3 hosts are sent to datadog-server.
    Sorts them: [15, 22, 25, 30, 33, 38, 42, 44, 45, 48, 55, 60, 67, 70, 72, 80, 85, 88, 95, 99, 105, 120, 130, 140, 150, 155, 180, 190, 200, 210]
    All the aggregations are computed on top of these all raw values. So distribution metrics are more mathematically correct. It is not an averaging of average, as is in the case of Histogram metrics.

Important to note that, since all the raw metric are sent to datadog-server, these are the most expensive kind of metrics.

Histogram vs Distribution:

Since Histogram and Distribution metrics run close to each other, it is important to note the difference between them. To call out again, Distribution metrics are datadog specific, and are not a native metric type in Opentelemetry.

Consider the case below. An application is running on two hosts A and B. Within the flush duration of 10sec, it emits the below array of values.

Host A (handles small payloads): [10, 12, 11, 13, 14] → p95 = 14
Host B (handles large payloads): [500, 600, 700, 800, 900] → p95 = 900

  • Histogram query: avg of p95s = (14 + 900) / 2 = 457
  • Distribution: true p95 of all 10 values = 800
    The “real” answer is 800, not 457.

See, how in case of Histogram, the aggregation for p95 is done with in the agent and only end value/host is sent to datadog-server. However, incase of Distribution since all the raw values endup in datadog server, it is mathematically more accurate.
The more skewed your traffic is across hosts, the more histogram’s per-host aggregation misleads you.
Don’t forget that Distribution metrics are the most expensive form of metrics. The price we pay for accuracy.

A decision tree: When to use what metric-type:

  • What are we measuring?
    • A thing that “Happened” (discrete event) ?
      • Just want tally occurrence ? – COUNT
        • errors, success, retries, cache hits etc.
      • Want to understand spread of values across occurrence ?
        • Per-host percentiles are file ? – HISTOGRAM
        • Need truly global percentiles ? – DISTRIBUTION
          • latency, payload size, batch size, duration etc.
    • A thing that “Exists” right now ?
      • GAUGE
        • kafka lag, active threads, pool size etc.

Cost of metric-types:

Metric TypeMultiplier1 tag combo10 tag combos
Countx1110
Gaugex1110
Histogramx7770
Distribution (base)x5550
Distribution (+all percentiles)x1010100

Histogram x7 aggregations: .avg, .count, .median, .95percentile, .max, .min, .sum
Distribution x5 base: .avg, .count, .max, .min, .sum
Distribution x10 (+percentiles): base 5 + .p50, .p75, .p90, .p95, .p99
Formula: custom metrics = tag combos × multiplier

Note that :

  • every time you emit 1 metric with 1 tag from your code:
    • if its histogram – it leads to 5 metrics
    • if its Distribution – it leads to 10 metrics
  • So, it is important to choose the right kind of metrics, based on the thing that you are trying to measure.

Take aways:

  • Know what is that you are trying to measure with the metric you are emitting from code and pick the right form of metrics.
  • Always try to avoid Distribution metrics. Whiles these can be more “mathematically accurate”, most of our usecases are covered with Histogram. You will need raw metric values of Distribution metric when high accuracy is of most importance.
  • Count metrics can also act as Rate. These are best fit for calculating and plotting metrics like throughput of application. The metric will still get emitted as count , but in tools like datadog, we can always plot a rate graph on them.

Leave a comment