发现问题
大部分开发人员习惯使用order by offset limit进行分页,使用该方法可能会导致扫描的数据放大,因为offset的行会被扫描。表现就是一般offset的行比较小的情况也,也就是翻页,是很快的,但是一旦offset的值很大,翻页的数量很大,那么一定会变慢。
看个例子:
#建表,插入数据,并分析收取统计信息 CREATE UNLOGGED TABLE data ( id bigint GENERATED ALWAYS AS IDENTITY, value double precision NOT NULL, created timestamp with time zone NOT NULL ); SELECT setseed(0.2740184); INSERT INTO data (value, created) SELECT random() * 1000, d FROM generate_series( TIMESTAMP '2022-01-01 00:00:00 UTC', TIMESTAMP '2022-12-31 00:00:00 UTC', INTERVAL '1 second' ) AS d(d); ALTER TABLE data ADD PRIMARY KEY (id); VACUUM (ANALYZE) data;
下面我们翻页获取数据看看性能如何:
#如下SQL,创建一个符合索引最合适 SELECT value, created FROM data WHERE value BETWEEN 0 AND 10 ORDER BY created; #创建索引 CREATE INDEX data_created_value_idx ON data (created, value); #可以看到是Index Only Scan,结果很快 postgres=# explain analyze SELECT value, created FROM data WHERE value BETWEEN 0 AND 10 ORDER BY created LIMIT 50; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.56..126.51 rows=50 width=16) (actual time=0.022..0.195 rows=50 loops=1) -> Index Only Scan using data_created_value_idx on data (cost=0.56..801382.02 rows=318146 width=16) (actual time=0.021..0.190 rows=50 loops=1) Index Cond: ((value >= '0'::double precision) AND (value <= '10'::double precision)) Heap Fetches: 0 Planning Time: 0.084 ms Execution Time: 0.210 ms #但是当我们OFFSET 200000的时候,执行计划就变了,代价比较高了,这就是我们前面所说的offset都要扫描 postgres=# explain analyze SELECT value, created FROM data WHERE value BETWEEN 0 AND 10 ORDER BY created OFFSET 200000 LIMIT 50; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=434757.47..434763.31 rows=50 width=16) (actual time=2697.793..2704.289 rows=50 loops=1) -> Gather Merge (cost=411422.51..442355.57 rows=265122 width=16) (actual time=2627.028..2695.579 rows=200050 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=410422.49..410753.89 rows=132561 width=16) (actual time=2607.976..2613.950 rows=67369 loops=3) Sort Key: created Sort Method: external merge Disk: 2760kB Worker 0: Sort Method: external merge Disk: 2640kB Worker 1: Sort Method: external merge Disk: 2640kB -> Parallel Seq Scan on data (cost=0.00..396876.00 rows=132561 width=16) (actual time=0.042..2551.663 rows=104958 loops=3) Filter: ((value >= '0'::double precision) AND (value <= '10'::double precision)) Rows Removed by Filter: 10378242 Planning Time: 0.102 ms Execution Time: 2704.851 ms (14 rows)
如上我们看到的,翻页越多,性能越差,唯一的好处,就是书写简单。
优化手段1:使用游标
由于普通游标只能在单个事务的上下文中工作。因此,普通游标对于分页的作用有限,因为在事务打开时进行用户交互是一个非常差的体验:长事务不仅会使表锁保持很长时间(这可能阻塞DDL或TRUNCATE语句),而且还会阻塞autovacuum的进程,从而导致表膨胀。
WITH HOLD | WITHOUT HOLD 默认值为WITHOUT HOLD, 使用WITH HOLD将CURSOR的使用范围扩大到SESSION级别,WITHOUT HOLD是TRANSACTION级别,另外,WITH HOLD将消耗更多的资源(内存或临时文件)来保持数据。
游标说明可以参考:https://www.postgresql.org/docs/13/sql-declare.html
#创建一个游标,取出满足的条件的结果集 begin; DECLARE c SCROLL CURSOR WITH HOLD FOR SELECT value, created FROM data WHERE value BETWEEN 0 AND 10 ORDER BY created; COMMIT; #移动游标并取出50行,和OFFSET 200000 LIMIT 50效果一样 MOVE ABSOLUTE 200000 IN c; FETCH 50 FROM c; #使用完游标后,必须记得关闭游标 close c;
使用游标的优点和缺点:
优点:
1.游标适用于所有分页查询,不管是第一页还是最后一页,效率一样
2.游标的结果集是稳定的
缺点:
1.当完成操作时,一定不要忘记关闭游标,否则结果集将保存在服务器上,直到数据库会话结束
2.如果游标长时间打开,数据将变的陈旧,无法获取动态的最新数据
3.游标长时间打开,相当于一个长事物,长事物的负面影响,相信大家有一定共识
优化手段2:使用位点
位点的原理很简单,就是记录上次查询出来的结果作为一个位点,查询的时候基于这个点的条件去查询。这样也就可以去掉offset了。注意要有一个pk,没有的话,需要加一个类似字段,这样位点才不会重复。
例子如下:
#通过查询记住下一页的起始位点,例子中id作为pk,标识唯一 SELECT id, value, created FROM data WHERE value BETWEEN 0 AND 10 ORDER BY created, id LIMIT 50; id | value | created ------+---------------------+------------------------ .........略......... 4568 | 7.771510504657186 | 2022-01-01 01:16:07+08 4586 | 1.2500308700502671 | 2022-01-01 01:16:25+08 4607 | 3.3096537558421346 | 2022-01-01 01:16:46+08 #我们必须记住从页面的最后一行创建的id的值。然后我们可以取下一页 SELECT id, value, created FROM data WHERE value BETWEEN 0 AND 10 AND (created, id) > ('2022-01-01 01:16:46+08', 4607) ORDER BY created, id LIMIT 50; postgres=# explain SELECT id, value, created FROM data WHERE value BETWEEN 0 AND 10 AND (created, id) > ('2022-01-01 01:16:46+08', 4607) ORDER BY created, id LIMIT 50; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------- Limit (cost=4.32..194.08 rows=50 width=24) -> Incremental Sort (cost=4.32..1207236.72 rows=318103 width=24) Sort Key: created, id Presorted Key: created -> Index Scan using data_created_value_idx on data (cost=0.56..1192922.08 rows=318103 width=24) Index Cond: ((created >= '2022-01-01 01:16:46+08'::timestamp with time zone) AND (value >= '0'::double precision) AND (value <= '10'::double pr ecision)) Filter: (ROW(created, id) > ROW('2022-01-01 01:16:46+08'::timestamp with time zone, 4607)) (7 rows) #加一个更适合的索引,执行计划会更好一些 CREATE INDEX data_keyset_idx ON data (created, id, value); postgres=# explain SELECT id, value, created FROM data WHERE value BETWEEN 0 AND 10 AND (created, id) > ('2022-01-01 01:16:46+08', 4607) ORDER BY created, id LIMIT 50; -------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------- Limit (cost=0.56..160.58 rows=50 width=24) -> Index Only Scan using data_keyset_idx on data (cost=0.56..1018064.43 rows=318103 width=24) Index Cond: ((ROW(created, id) > ROW('2022-01-01 01:16:46+08'::timestamp with time zone, 4607)) AND (value >= '0'::double precision) AND (value <= '1 0'::double precision)) (3 rows)
位点优化的有点和缺点:
优点:
1.每次查询只获取我们需要的数据,不需要扫描不额外的数据,减少了相关资源代价
2.每个查询将展示最新并发数据修改的当前数据
缺点:
1.需要一个专门为查询而设计的特殊索引
2.只有事先能获取到确切的位点,查询时才有用
参考:
https://www.cybertec-postgresql.com/en/pagination-problem-total-result-count/#total-count