PostgreSQL将滞后和前导合并到查询n个上一行和下一行

PostgreSQL将滞后和前导合并到查询n个上一行和下一行,postgresql,window-functions,linguistics,Postgresql,Window Functions,Linguistics,我有一个PostgreSQL表,我们称之为tokens,包含每行文本中每个标记的语法注释,基本上如下所示: idx | line | tno | token | annotation | lemma ----+------+-----+---------+-----------------+--------- 1 | I.01 | 1 | This | DEM.PROX | this 2 | I.01 | 2 | is | VB.COP

我有一个PostgreSQL表,我们称之为tokens,包含每行文本中每个标记的语法注释,基本上如下所示:

idx | line | tno | token   | annotation      | lemma
----+------+-----+---------+-----------------+---------
  1 | I.01 | 1   | This    | DEM.PROX        | this
  2 | I.01 | 2   | is      | VB.COP.3SG.PRES | be
  3 | I.01 | 3   | an      | ART.INDEF       | a
  4 | I.01 | 4   | example | NN.INAN         | example
我想做一个查询,允许我搜索语法上下文,在本例中,这个查询检查在当前行前后大小为n的窗口中是否存在某个注释。据我了解,PostgreSQL的窗口函数
LEAD
LAG
适合实现这一点。首先,我根据我能找到的关于这些函数的文档编写了以下查询:

SELECT *
FROM (
    SELECT token, annotation, lemma,
        -- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
        LEAD(annotation) OVER next_rows AS next_anno
    FROM tokens
    WINDOW next_rows AS (
        ORDER BY line, tno ASC
        ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
    )
    ORDER BY line, tno ASC
) AS "window"
WHERE
    lemma LIKE '...'
    AND "window".next_anno LIKE '...'
;
但是,这仅搜索以下两行。我的问题是,如何重新表述查询,使窗口同时包含表中的上一行和下一行?显然,我不能有2个
WINDOW
语句或做类似的事情

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING

我不确定我是否正确地理解了您的用例:您想检查一个给定的注释是否位于5行中的一行(前2行、当前2行、后2行)。对吗


  • 可以在前两个和后两个之间定义类似
    的窗口
  • LEAD
    LAG
    仅给出一个值,在这种情况下,当前行之后或之前的一个值-如果窗口支持该值;无论窗口包含多少行。但是您想签入这五行中的任何一行
  • 实现这一目标的一种方法:

  • 如上所述定义窗口
  • 使用
    array\u agg
    聚合这五行中的所有注释(如果可能),这将给出一个数组
  • unnest
    为每个元素将此数组扩展为一行,因为我无法使用类似于
    搜索数组元素。这将提供以下结果(可在下一步中过滤):
  • 结果子查询:

    token     annotation        lemma     surrounded_annos
    This      DEM.PROX          this      DEM.PROX
    This      DEM.PROX          this      VB.COP.3SG.PRES
    This      DEM.PROX          this      ART.INDEF
    is        VB.COP.3SG.PRES   be        DEM.PROX
    is        VB.COP.3SG.PRES   be        VB.COP.3SG.PRES
    is        VB.COP.3SG.PRES   be        ART.INDEF
    is        VB.COP.3SG.PRES   be        NN.INAN
    an        ART.INDEF         a         DEM.PROX
    an        ART.INDEF         a         VB.COP.3SG.PRES
    an        ART.INDEF         a         ART.INDEF
    an        ART.INDEF         a         NN.INAN
    example   NN.INAN           example   VB.COP.3SG.PRES
    example   NN.INAN           example   ART.INDEF
    example   NN.INAN           example   NN.
    

    另一种方法是计算句子中每个标记的相对位置,并执行标记的自连接(这将允许您根据距离选择跳过克):



    这似乎基本上满足了我的要求(谢谢!),但是,如果所包围的_annos筛选条件为负(不类似),如果在窗口中找到筛选条件的谓词,是否有方法消除标记?因此:
    WHERE引理类似于“an”和“window”。与“%VB.COP%”不同的引理应该返回一个空结果,因为包含“an”的行有一个邻居,
    类似于“%VB.COP%”
    token     annotation        lemma     surrounded_annos
    This      DEM.PROX          this      DEM.PROX
    This      DEM.PROX          this      VB.COP.3SG.PRES
    This      DEM.PROX          this      ART.INDEF
    is        VB.COP.3SG.PRES   be        DEM.PROX
    is        VB.COP.3SG.PRES   be        VB.COP.3SG.PRES
    is        VB.COP.3SG.PRES   be        ART.INDEF
    is        VB.COP.3SG.PRES   be        NN.INAN
    an        ART.INDEF         a         DEM.PROX
    an        ART.INDEF         a         VB.COP.3SG.PRES
    an        ART.INDEF         a         ART.INDEF
    an        ART.INDEF         a         NN.INAN
    example   NN.INAN           example   VB.COP.3SG.PRES
    example   NN.INAN           example   ART.INDEF
    example   NN.INAN           example   NN.
    
    WITH www AS (   -- enumerate word posision with sentences
        SELECT line, tno    -- candidate key
            , row_number() OVER sentence AS rn
        FROM tokens
        WINDOW sentence AS ( ORDER BY line ASC, tno ASC)
            )
    SELECT t0.line AS line
            , t0.token AS this
            , t1.tno AS tno
            , w1.rn - w0.rn AS rel  -- relative position
            , t1.token AS that
            , t1.annotation AS anno
    FROM tokens t0
    JOIN tokens t1 ON t1.line = t0.line     -- same sentence
    JOIN www w0 ON t0.line = w0.line AND t0.tno= w0.tno -- PK1
    JOIN www w1 ON t1.line = w1.line AND t1.tno= w1.tno -- PK2
    WHERE 1=1
    AND t0.lemma LIKE 'be'
        -- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn  = -1
            ;
    
    -- But, if you rno is consecutive(gapless) within lines,
    -- you can omit the enumeration step, and do a plain self-join:
    
    SELECT t0.line AS line
            , t0.token AS this
            , t1.tno AS tno
            , t1.tno - t0.tno AS rel        -- relative position
            , t1.token AS that
            , t1.annotation AS anno
    FROM tokens t0
    JOIN tokens t1 ON t1.line = t0.line     -- same sentence
    WHERE 1=1
    AND t0.lemma LIKE 'be'
        -- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn  = -1
            ;