Multivariate Statistics Examples
Multivariate Statistics Examples
Functional Dependencies
Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values:
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;
t using the number of pages and rows obtained from pg_class:
SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
relpages | reltuples
----------+-----------
45 | 10000
The following example shows the result of estimating a WHERE condition on the a column:
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1;
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100.00 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 9900
WHERE condition to use the b column, an identical plan is generated. But observe what happens if we apply the same condition on both columns, combining them with AND:
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-----------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100.00 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
This problem can be fixed by creating a statistics object that directs ANALYZE to calculate functional-dependency multivariate statistics on the two columns:
CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Multivariate N-Distinct Counts
A similar problem occurs with estimation of the cardinality of sets of multiple columns, such as the number of groups that would be generated by a GROUP BY clause. When GROUP BY lists a single column, the n-distinct estimate (which is visible as the estimated number of rows returned by the HashAggregate node) is very accurate:
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a;
QUERY PLAN
-----------------------------------------------------------------------------------------
HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100.00 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000.00 loops=1)
GROUP BY, as in the following example, is off by an order of magnitude:
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100.00 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100.00 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000.00 loops=1)
MCV Lists
As explained in Functional Dependencies, functional dependencies are very cheap and efficient type of statistics, but their main limitation is their global nature (only tracking dependencies at the column level, not between individual column values).
This section introduces multivariate variant of MCV (most-common values) lists, a straightforward extension of the per-column statistics described in Row Estimation Examples. These statistics address the limitation by storing individual values, but it is naturally more expensive, both in terms of building the statistics in ANALYZE, storage and planning time.
Let's look at the query from Functional Dependencies again, but this time with a MCV list created on the same set of columns (be sure to drop the functional dependencies, to make sure the planner uses the newly created statistics).
DROP STATISTICS stts;
CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100.00 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Inspecting the MCV list is possible using pg_mcv_list_items set-returning function.
SELECT m.* FROM pg_statistic_ext JOIN pg_statistic_ext_data ON (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
index | values | nulls | frequency | base_frequency
-------+----------+-------+-----------+----------------
0 | {0, 0} | {f,f} | 0.01 | 0.0001
1 | {1, 1} | {f,f} | 0.01 | 0.0001
...
49 | {49, 49} | {f,f} | 0.01 | 0.0001
50 | {50, 50} | {f,f} | 0.01 | 0.0001
...
97 | {97, 97} | {f,f} | 0.01 | 0.0001
98 | {98, 98} | {f,f} | 0.01 | 0.0001
99 | {99, 99} | {f,f} | 0.01 | 0.0001
(100 rows)
nulls column.
When estimating the selectivity, the planner applies all the conditions on items in the MCV list, and then sums the frequencies of the matching ones. See mcv_clauselist_selectivity in src/backend/statistics/mcv.c for details.
Compared to functional dependencies, MCV lists have two major advantages. Firstly, the list stores actual values, making it possible to decide which combinations are compatible.
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
Filter: ((a = 1) AND (b = 10))
Rows Removed by Filter: 10000
EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT * FROM t WHERE a <= 49 AND b > 49;
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0.00 loops=1)
Filter: ((a <= 49) AND (b > 49))
Rows Removed by Filter: 10000