Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a question about calculating response times with Prometheus summary metrics.

I created a summary metric that does not only contain the service name but also the complete path and the http-method.

Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.

As far as I read this should be the correct way to calculate the response time per second:

sum by(service_id) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    rate(request_duration_count{status_code=~"2.*"}[5m])

What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.

This looks absolutely wrong for me - but I think it does not work in the way I understand it.

Another way to get an equal looking result is this:

sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    rate(request_duration_count{status_code=~"2.*"}[5m])
  • But what is the difference?
  • What is really happening here?
  • And why do I honestly only get measurable values if I use "max" instead of "sum"?
  • If I would ignore everything I read I would try it in the following way:

    rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
    rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
    

    But this will not work at all... (instant vector vs range vector and so on...).

    All of these examples are aggregating incorrectly, as you're averaging an average. You want:

      sum without (path,host) (
        rate(request_duration_sum{status_code=~"2.*"}[5m])
      sum without (path,host) (
        rate(request_duration_count{status_code=~"2.*"}[5m])
    

    Which will return the average latency per status_code plus any other remaining labels.

    I think this is right - because you wrote it. But I would like to understand what is really done by the given queries. What lecture do I have to study? Online-Courses, Bible...? ;-) – eventhorizon Jun 27, 2018 at 17:46
  • The by modifier groups aggregate function results by labels enumerated inside by(...).
  • The without modifier groups aggregate function results by all the labels except those enumerated inside without(...).
  • For example, suppose process_resident_memory_bytes metric exists with job, instance and datacenter labels:

    process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc1"} N1
    process_resident_memory_bytes{job="job1",instance="host2",datacenter="dc1"} N2
    process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc2"} N3
    process_resident_memory_bytes{job="job2",instance="host1",datacenter="dc1"} N4
    

    Then sum(process_resident_memory_bytes) by (datacenter) would return summary per-datacenter memory usage, while sum(process_resident_memory_bytes) without (instance) would return summary per-job per-datacenter memory usage.

    Using Prometheus metrics in Grafana, the without keyword did not work for me (at least as I expected it to). I got satisfying results with by:

      sum by (status_code)(
        rate(request_duration_sum{status_code=~"2.*"}[5m])
      sum by (status_code)(
        rate(request_duration_sum{status_code=~"2.*"}[5m])
            

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.