Statistics, cardinality, selectivity
SQL is a declarative language. It is a language where the user asks what he wants. Without specifying how the computer should proceed to get the results.
It is the DBMS that must find “how” to perform the operation by ensuring:
Return the right result
Ideally, as soon as possible
“As soon as possible” means:
Minimize disk access
Give priority to sequential readings (especially important for mechanical disks)
Reduce the number of CPU operations
Reduce memory footprint
To do this, a DBMS has an optimizer whose role is to find the best execution plan.
PostgreSQL has an optimizer based on a cost mechanism. Without going into details, each operation has a unit cost (reading a sequential block, CPU processing of a record …). Postgres calculates the cost of several execution plans (if the query is simple) and chooses the least expensive.
How can postgres estimate the cost of a plan? By estimating the cost of each node of the plan based on statistics. PostgreSQL analyzes tables to obtain a statistical sample (this operation is normally performed by the autovacuum daemon).
Some words of vocabulary:
Cardinality: In set theory, it is the number of elements in a set. In databases, it will be the number of rows in a table or after applying a predicate.
Selectivity: Fraction of records returned after applying a predicate. For example, a table containing people and about one third of them are children. The selectivity of the predicate person = 'child'
will be 0.33.
If this table contains 300 people (this is the cardinality of the “people” set), we can estimate the number of children because we know that the predicate person = 'child'
is 0.33: