提高PostgreSQL中多个JSONB字段搜索的排名时间
我的搜索时间现在实际上相当快,但一旦我开始为他们的最佳结果排名,我就碰上了一堵墙。我的命中率越高,速度就越慢。对于不常见的术语,搜索时间约为2ms,对于更常见的术语,搜索时间约为900ms+。在本例中,我收集了数据中所有可能的结构(简单、数组、嵌套数组) 然后,我构建一个函数,将嵌套数组字段“author”的名称值连接起来:提高PostgreSQL中多个JSONB字段搜索的排名时间,postgresql,indexing,full-text-search,ranking,jsonb,Postgresql,Indexing,Full Text Search,Ranking,Jsonb,我的搜索时间现在实际上相当快,但一旦我开始为他们的最佳结果排名,我就碰上了一堵墙。我的命中率越高,速度就越慢。对于不常见的术语,搜索时间约为2ms,对于更常见的术语,搜索时间约为900ms+。在本例中,我收集了数据中所有可能的结构(简单、数组、嵌套数组) 然后,我构建一个函数,将嵌套数组字段“author”的名称值连接起来: CREATE OR REPLACE FUNCTION author_function( IN data JSONB, OUT resultNames
CREATE OR REPLACE FUNCTION author_function(
IN data JSONB,
OUT resultNames TSVECTOR
)
RETURNS TSVECTOR AS $$
DECLARE
authorRecords RECORD;
combinedAuthors JSONB [];
singleAuthor JSONB;
BEGIN
FOR authorRecords IN (SELECT value
FROM jsonb_array_elements(data #> '{authors}'))
LOOP
combinedAuthors := combinedAuthors || authorRecords.value;
END LOOP;
FOREACH singleAuthor IN ARRAY coalesce(combinedAuthors, '{}')
LOOP
resultNames := resultNames ||
coalesce(to_tsvector('english', singleAuthor ->> 'name'), to_tsvector('english', ''));
END LOOP;
END; $$
LANGUAGE plpgsql
IMMUTABLE;
我需要一个函数,在这个函数上我可以为多个连接字段建立索引:
CREATE OR REPLACE FUNCTION multi_field_function(
IN data JSONB
)
RETURNS TSVECTOR AS $$
BEGIN
RETURN
coalesce(to_tsvector('english', data ->> 'title'),
to_tsvector('english', '')) ||
coalesce(to_tsvector('english', data ->> 'subtitles'),
to_tsvector('english', '')) ||
coalesce(author_function(data),
to_tsvector('english', ''));
END; $$
LANGUAGE plpgsql
IMMUTABLE;
现在我需要建立索引
CREATE INDEX book_title_idx
ON book USING GIN (to_tsvector('english', book.data ->> 'title'));
CREATE INDEX book_subtitle_idx
ON book USING GIN (to_tsvector('english', book.data ->> 'subtitles'));
CREATE INDEX book_author_idx
ON book USING GIN (author_function(book.data));
CREATE INDEX book_multi_field_idx
ON book USING GIN (multi_field_function(book.data));
最后,我添加了一些测试数据:
INSERT INTO book (data)
VALUES (CAST('{"title": "Cats",' ||
'"subtitles": ["Cats", "Dogs"],' ||
'"author": [{"id": 0, "name": "Cats"}, ' ||
' {"id": 1, "name": "Dogs"}]}' AS JSONB));
INSERT INTO book (data)
VALUES (CAST('{"title": "ats",' ||
'"subtitles": ["Cats", "ogs"],' ||
'"author": [{"id": 2, "name": "ats"}, ' ||
' {"id": 3, "name": "ogs"}]}' AS JSONB));
当我查询我的multi_field_函数时,我得到了我想要的结果
EXPLAIN ANALYZE
SELECT *
FROM (
SELECT
id,
data,
ts_rank(query, 'cat:*') AS score
FROM
book,
multi_field_function(data) query
WHERE multi_field_function(data) @@ to_tsquery('cat:*')
ORDER BY score DESC) a
WHERE score > 0
ORDER BY score DESC;
在我的真实数据上,这将导致以下查询计划。在那里你可以看到,只有最后一步的排名是真正缓慢的
Sort (cost=7921.72..7927.87 rows=2460 width=143) (actual time=949.644..952.263 rows=16926 loops=1)
Sort Key: (ts_rank(query.query, '''cat'':*'::tsquery)) DESC
Sort Method: external merge Disk: 4376kB
-> Nested Loop (cost=47.31..7783.17 rows=2460 width=143) (actual time=3.750..933.719 rows=16926 loops=1)
-> Bitmap Heap Scan on book (cost=47.06..7690.67 rows=2460 width=1305) (actual time=3.582..11.904 rows=16926 loops=1)
Recheck Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
Heap Blocks: exact=3695
-> Bitmap Index Scan on book_multi_field_idx (cost=0.00..46.45 rows=2460 width=0) (actual time=3.128..3.128 rows=16926 loops=1)
Index Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
-> Function Scan on multi_field_function query (cost=0.25..0.27 rows=1 width=32) (actual time=0.049..0.049 rows=1 loops=16926)
Filter: (ts_rank(query, '''cat'':*'::tsquery) > '0'::double precision)
Planning time: 0.163 ms
Execution time: 953.624 ms
有没有什么方法可以保持我的json结构,并且仍然能够获得多个字段的良好和快速的搜索结果
编辑:
我不得不修改Vao Tsun的查询,因为它无法从内部识别“查询”
EXPLAIN ANALYZE
SELECT
*,
ts_rank(query, 'cat:*') AS score
FROM (
SELECT
id,
data
FROM
book
WHERE multi_field_function(data) @@ to_tsquery('cat:*')
) a,
multi_field_function(a.data) query
ORDER BY score DESC;
遗憾的是,演出没有多大变化:
Sort (cost=7880.82..7886.97 rows=2460 width=1343) (actual time=863.542..875.035 rows=16840 loops=1)
Sort Key: (ts_rank(query.query, '''cat'':*'::tsquery)) DESC
Sort Method: external merge Disk: 25280kB
-> Nested Loop (cost=43.31..7742.27 rows=2460 width=1343) (actual time=3.570..821.861 rows=16840 loops=1)
-> Bitmap Heap Scan on book (cost=43.06..7686.67 rows=2460 width=1307) (actual time=3.362..12.085 rows=16840 loops=1)
Recheck Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on book_multi_field_idx (cost=0.00..42.45 rows=2460 width=0) (actual time=2.934..2.934 rows=16840 loops=1)
Index Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
-> Function Scan on multi_field_function query (cost=0.25..0.26 rows=1 width=32) (actual time=0.047..0.047 rows=1 loops=16840)
Planning time: 0.090 ms
Execution time: 879.736 ms
外部合并磁盘:4376kB
-您的工作内存是多少?尝试“将work_mem设置为“16MB”;`然后再次运行explain
?…现在它使用7523kB。这只是一个50毫秒的改进。你在搜索多少行?它看起来像是“书本多字段位图索引扫描”idx(成本=0.00..46.45行=2460宽度=0)(实际时间=3.128..3.128行=16926循环=1)中至少有几百万行,对于这组数据,必须运行16K次(循环=16926)这是一个瓶颈?。外部合并磁盘:4376kB
-您的工作内存是多少?
-1MB
?尝试“将work_mem设置为“16MB”;`然后再次运行explain
?…现在它使用7523kB。这只是一个50毫秒的改进。你在搜索多少行?它看起来像是“书本多字段位图索引扫描”idx(成本=0.00..46.45行=2460宽度=0)(实际时间=3.128..3.128行=16926循环=1)中至少有几百万行,`对于这组数据,必须运行16K次(循环=16926),这是一个瓶颈?。。
Sort (cost=7880.82..7886.97 rows=2460 width=1343) (actual time=863.542..875.035 rows=16840 loops=1)
Sort Key: (ts_rank(query.query, '''cat'':*'::tsquery)) DESC
Sort Method: external merge Disk: 25280kB
-> Nested Loop (cost=43.31..7742.27 rows=2460 width=1343) (actual time=3.570..821.861 rows=16840 loops=1)
-> Bitmap Heap Scan on book (cost=43.06..7686.67 rows=2460 width=1307) (actual time=3.362..12.085 rows=16840 loops=1)
Recheck Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on book_multi_field_idx (cost=0.00..42.45 rows=2460 width=0) (actual time=2.934..2.934 rows=16840 loops=1)
Index Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
-> Function Scan on multi_field_function query (cost=0.25..0.26 rows=1 width=32) (actual time=0.047..0.047 rows=1 loops=16840)
Planning time: 0.090 ms
Execution time: 879.736 ms