PostgreSQL 9.4中按ts_向量出现次数查询词素_Postgresql_Nlp_Postgresql 9.4_Tsvector

PostgreSQL 9.4中按ts_向量出现次数查询词素

postgresql nlp

PostgreSQL 9.4中按ts_向量出现次数查询词素,postgresql,nlp,postgresql-9.4,tsvector,Postgresql,Nlp,Postgresql 9.4,Tsvector,是否可以使用WHERE语句根据词素在ts_向量中出现的次数来查询PostgreSQL 例如，如果您创建了一个带有短语top hat的ts_向量，您可以从ts_向量@@{词素“top”出现两次}的表中选择*吗？您可以使用此函数： create or replace function number_of_occurrences(vector tsvector, token text) returns integer language sql stable as $$ select coale

是否可以使用WHERE语句根据词素在ts_向量中出现的次数来查询PostgreSQL

例如，如果您创建了一个带有短语top hat的ts_向量，您可以从ts_向量@@{词素“top”出现两次}的表中选择*吗？

您可以使用此函数：

create or replace function number_of_occurrences(vector tsvector, token text)
returns integer language sql stable as $$
    select coalesce((
        select length(elem)- length(replace(elem, ',', ''))+ 1
        from unnest(string_to_array(vector::text, ' ')) elem
        where trim(elem, '''') like token || '%'), 0)
$$;

select number_of_occurrences(to_tsvector('top hat on top of the cat'), 'top');

 number_of_occurrences 
-----------------------
                     2
(1 row)

SELECT * 
FROM a_table 
WHERE ts_vector @@ to_tsquery('top')
AND number_of_occurrences(ts_vector, 'top') = 2;

当然，只有当向量包含带位置的词素时，该函数才能正常工作

select to_tsvector('top hat on top of the cat');

                   to_tsvector                   
-------------------------------------------------
 'cat':7 'hat':2 'of':5 'on':3 'the':6 'top':1,4
(1 row)

使用该函数的示例：

create or replace function number_of_occurrences(vector tsvector, token text)
returns integer language sql stable as $$
    select coalesce((
        select length(elem)- length(replace(elem, ',', ''))+ 1
        from unnest(string_to_array(vector::text, ' ')) elem
        where trim(elem, '''') like token || '%'), 0)
$$;

select number_of_occurrences(to_tsvector('top hat on top of the cat'), 'top');

 number_of_occurrences 
-----------------------
                     2
(1 row)

SELECT * 
FROM a_table 
WHERE ts_vector @@ to_tsquery('top')
AND number_of_occurrences(ts_vector, 'top') = 2;

为此，可以使用unnest和array_长度的组合

挑选* 从桌子上哪里选择阵列长度位置，1 从unnests_向量其中lexeme='top' = 2

我不认为这将能够在ts_向量上使用GIN索引，但这可能比在accepted answer函数中执行的字符串操作更快。

这没有回答核心问题，即使用WHERE子句查找词素出现的能力。不过，感谢您迄今为止的投入！谢谢作为补充，你知道这是什么吗？我有20亿条消息要搜索，而ts_向量已经被编入索引。我担心这个函数对于这么多的数据来说太昂贵了。您必须在搜索查询中使用索引，如更新的示例中所示。为了提高性能，我重新构建了我的函数，消除了代价高昂的regexp_uu函数。现在，该功能应该快几倍。