Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a question about calculating response times with Prometheus summary metrics.
I created a summary metric that does not only contain the service name but also the complete path and the http-method.
Now I try to calculate the average response time for the complete service.
I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.
As far as I read this should be the correct way to calculate the response time per second:
sum by(service_id) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
rate(request_duration_count{status_code=~"2.*"}[5m])
What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.
This looks absolutely wrong for me - but I think it does not work in the way I understand it.
Another way to get an equal looking result is this:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
rate(request_duration_count{status_code=~"2.*"}[5m])
But what is the difference?
What is really happening here?
And why do I honestly only get measurable values if I use "max" instead of "sum"?
If I would ignore everything I read I would try it in the following way:
rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
But this will not work at all... (instant vector vs range vector and so on...).
All of these examples are aggregating incorrectly, as you're averaging an average. You want:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
sum without (path,host) (
rate(request_duration_count{status_code=~"2.*"}[5m])
Which will return the average latency per status_code
plus any other remaining labels.
–
The by
modifier groups aggregate function results by labels enumerated inside by(...)
.
The without
modifier groups aggregate function results by all the labels except those enumerated inside without(...)
.
For example, suppose process_resident_memory_bytes
metric exists with job
, instance
and datacenter
labels:
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc1"} N1
process_resident_memory_bytes{job="job1",instance="host2",datacenter="dc1"} N2
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc2"} N3
process_resident_memory_bytes{job="job2",instance="host1",datacenter="dc1"} N4
Then sum(process_resident_memory_bytes) by (datacenter)
would return summary per-datacenter
memory usage, while sum(process_resident_memory_bytes) without (instance)
would return summary per-job
per-datacenter
memory usage.
Using Prometheus metrics in Grafana, the without
keyword did not work for me (at least as I expected it to). I got satisfying results with by
:
sum by (status_code)(
rate(request_duration_sum{status_code=~"2.*"}[5m])
sum by (status_code)(
rate(request_duration_sum{status_code=~"2.*"}[5m])
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.