Sql postgres上单词匹配的正则表达式性能较差_Sql_Regex_Postgresql_Postgresql 9.3

Sql postgres上单词匹配的正则表达式性能较差

sql regex postgresql

Sql postgres上单词匹配的正则表达式性能较差,sql,regex,postgresql,postgresql-9.3,Sql,Regex,Postgresql,Postgresql 9.3,我有一个阻止的短语列表，我想匹配用户输入文本中是否存在这些短语，但性能非常差我正在使用此查询： SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text' )) ~* ('[[:<:]]' || value || '[[:>:]]') LIMIT 1; 它比第一个快。问题是我需要保持单词边界的测试这种检查经常在大型程序上执行，因此性能对我来说非常重要你们有什么建议可以加快速度吗既然您知道LIKE

我有一个阻止的短语列表，我想匹配用户输入文本中是否存在这些短语，但性能非常差

我正在使用此查询：

SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text' )) ~* ('[[:<:]]' || value || '[[:>:]]') LIMIT 1;

它比第一个快。问题是我需要保持单词边界的测试

这种检查经常在大型程序上执行，因此性能对我来说非常重要

你们有什么建议可以加快速度吗

既然您知道

LIKE

（

）查询速度快，而RegEx（

）查询速度慢，那么最简单的解决方案就是将这两个条件结合起来（

\m

\m

是

[：]

）：

这样，快速查询条件过滤掉大部分行，然后慢速条件丢弃其余的行

我使用的是更快的区分大小写的运算符，假设

value

已经标准化。如果不是这样，请删除（然后是冗余的）

lower（）

，并使用与原始查询中相同的区分大小写的版本

在我的测试集中，有370k行，可以将查询从6s（温暖）加速到90ms：

                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1651.85 rows=1 width=10) (actual time=89.702..89.702 rows=1 loops=1)
   ->  Seq Scan on blocked_items  (cost=0.00..14866.61 rows=9 width=10) (actual time=89.701..89.701 rows=1 loops=1)
         Filter: ((lower(unaccent('my input text'::text)) ~~ (('%'::text || value) || '%'::text)) AND (lower(unaccent('my input text'::text)) ~ (('\m'::text || value) || '\M'::text)))
         Rows Removed by Filter: 153281
 Planning Time: 0.097 ms
 Execution Time: 89.717 ms
(6 rows)

但是，我们仍在进行全表扫描，性能将根据表中的位置而有所不同

理想情况下，我们可以使用索引在几乎恒定的时间内回答查询。

让我们重写查询以使用：

首先，我们将输入拆分为，然后检查阻塞短语是否与这些向量匹配

测试查询大约需要440ms，这是我们组合查询的一个倒退：

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..104.01 rows=1 width=10) (actual time=437.761..437.761 rows=1 loops=1)
   ->  Seq Scan on blocked_items  (cost=0.00..192516.05 rows=1851 width=10) (actual time=437.760..437.760 rows=1 loops=1)
         Filter: (to_tsvector('simple'::regconfig, unaccent('my input text'::text)) @@ phraseto_tsquery('simple'::regconfig, value))
         Rows Removed by Filter: 153281
 Planning Time: 0.063 ms
 Execution Time: 437.772 ms
(6 rows)

由于我们不能使用

tsvector@@tsquery

对

tsquery

进行索引，因此我们可以使用

tsquery@>tsquery

再次重写查询，以检查被阻止的短语是否是输入短语的子查询，然后可以使用以下目录中的

tsquery\u ops

对其进行索引：

查询现在可以使用索引扫描，并对相同的数据进行20ms

由于GiST是一个有损索引，查询时间可能会有所不同，具体取决于需要多少次复查：

                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.54..4.23 rows=1 width=10) (actual time=19.215..19.215 rows=1 loops=1)
   ->  Index Scan using blocked_items_search on blocked_items  (cost=0.54..1367.01 rows=370 width=10) (actual time=19.214..19.214 rows=1 loops=1)
         Index Cond: (phraseto_tsquery('simple'::regconfig, value) <@ phraseto_tsquery('simple'::regconfig, unaccent('my input text'::text)))
         Rows Removed by Index Recheck: 4028
 Planning Time: 0.093 ms
 Execution Time: 19.236 ms
(6 rows)

查询计划
--------------------------------------------------------------------------------------------------------------------------------------------------
限制（成本=0.54..4.23行=1宽度=10）（实际时间=19.215..19.215行=1圈=1）
->使用阻止项搜索阻止项的索引扫描（成本=0.54..1367.01行=370宽度=10）（实际时间=19.214..19.214行=1循环=1）
索引条件：（短语to_tsquery（'simple'：：regconfig，value）也许你可以使用？也许这会更好。你不仅删除了单词边界断言，还将运算符从正则表达式匹配改为子字符串匹配。当然，做根本不同的事情需要不同的时间。如果我只是添加或删除单词边界断言，而仍然使用相同的运算符，我会获取时间上的细微差异。当然，我必须编写自己的正则表达式，因为我不知道您的blocked_items表中有什么。您可以在list_id列上添加btree索引，以减少seqscan中的行数。
SELECT value FROM blocked_items
WHERE to_tsvector('simple', unaccent('my input text'))
   @@ phraseto_tsquery('simple', value)
LIMIT 1;

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..104.01 rows=1 width=10) (actual time=437.761..437.761 rows=1 loops=1)
   ->  Seq Scan on blocked_items  (cost=0.00..192516.05 rows=1851 width=10) (actual time=437.760..437.760 rows=1 loops=1)
         Filter: (to_tsvector('simple'::regconfig, unaccent('my input text'::text)) @@ phraseto_tsquery('simple'::regconfig, value))
         Rows Removed by Filter: 153281
 Planning Time: 0.063 ms
 Execution Time: 437.772 ms
(6 rows)

CREATE INDEX blocked_items_search ON blocked_items
  USING gist (phraseto_tsquery('simple', value));

ANALYZE blocked_items; -- update query planner stats

SELECT value FROM blocked_items
WHERE phraseto_tsquery('simple', unaccent('my input text'))
   @> phraseto_tsquery('simple', value)
LIMIT 1;

                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.54..4.23 rows=1 width=10) (actual time=19.215..19.215 rows=1 loops=1)
   ->  Index Scan using blocked_items_search on blocked_items  (cost=0.54..1367.01 rows=370 width=10) (actual time=19.214..19.214 rows=1 loops=1)
         Index Cond: (phraseto_tsquery('simple'::regconfig, value) <@ phraseto_tsquery('simple'::regconfig, unaccent('my input text'::text)))
         Rows Removed by Index Recheck: 4028
 Planning Time: 0.093 ms
 Execution Time: 19.236 ms
(6 rows)