prometheus - Difference between PromQL "by" and "without" unclear

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a question about calculating response times with Prometheus summary metrics.

I created a summary metric that does not only contain the service name but also the complete path and the http-method.

Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.

As far as I read this should be the correct way to calculate the response time per second:

sum by(service_id) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    rate(request_duration_count{status_code=~"2.*"}[5m])
What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.
This looks absolutely wrong for me - but I think it does not work in the way I understand it.
Another way to get an equal looking result is this:
sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    rate(request_duration_count{status_code=~"2.*"}[5m])
But what is the difference?
What is really happening here?
And why do I honestly only get measurable values if I use "max" instead of "sum"?
If I would ignore everything I read I would try it in the following way:
rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
But this will not work at all... (instant vector vs range vector and so on...).
All of these examples are aggregating incorrectly, as you're averaging an average. You want:
  sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  sum without (path,host) (
    rate(request_duration_count{status_code=~"2.*"}[5m])
Which will return the average latency per status_code plus any other remaining labels.
                I think this is right - because you wrote it. But I would like to understand what is really done by the given queries. What lecture do I have to study? Online-Courses, Bible...? ;-)
– eventhorizon
                Jun 27, 2018 at 17:46
The by modifier groups aggregate function results by labels enumerated inside by(...).
The without modifier groups aggregate function results by all the labels  except those enumerated inside without(...).
For example, suppose process_resident_memory_bytes metric exists with job, instance and datacenter labels:
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc1"} N1
process_resident_memory_bytes{job="job1",instance="host2",datacenter="dc1"} N2
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc2"} N3
process_resident_memory_bytes{job="job2",instance="host1",datacenter="dc1"} N4
Then sum(process_resident_memory_bytes) by (datacenter) would return summary per-datacenter memory usage, while sum(process_resident_memory_bytes) without (instance) would return summary per-job per-datacenter memory usage.
Using Prometheus metrics in Grafana, the without keyword did not work for me (at least as I expected it to). I got satisfying results with by:
  sum by (status_code)(
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  sum by (status_code)(
    rate(request_duration_sum{status_code=~"2.*"}[5m])
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.