PostgreSQL将滞后和前导合并到查询n个上一行和下一行
我有一个PostgreSQL表,我们称之为tokens,包含每行文本中每个标记的语法注释,基本上如下所示:PostgreSQL将滞后和前导合并到查询n个上一行和下一行,postgresql,window-functions,linguistics,Postgresql,Window Functions,Linguistics,我有一个PostgreSQL表,我们称之为tokens,包含每行文本中每个标记的语法注释,基本上如下所示: idx | line | tno | token | annotation | lemma ----+------+-----+---------+-----------------+--------- 1 | I.01 | 1 | This | DEM.PROX | this 2 | I.01 | 2 | is | VB.COP
idx | line | tno | token | annotation | lemma
----+------+-----+---------+-----------------+---------
1 | I.01 | 1 | This | DEM.PROX | this
2 | I.01 | 2 | is | VB.COP.3SG.PRES | be
3 | I.01 | 3 | an | ART.INDEF | a
4 | I.01 | 4 | example | NN.INAN | example
我想做一个查询,允许我搜索语法上下文,在本例中,这个查询检查在当前行前后大小为n的窗口中是否存在某个注释。据我了解,PostgreSQL的窗口函数LEAD
和LAG
适合实现这一点。首先,我根据我能找到的关于这些函数的文档编写了以下查询:
SELECT *
FROM (
SELECT token, annotation, lemma,
-- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
LEAD(annotation) OVER next_rows AS next_anno
FROM tokens
WINDOW next_rows AS (
ORDER BY line, tno ASC
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".next_anno LIKE '...'
;
但是,这仅搜索以下两行。我的问题是,如何重新表述查询,使窗口同时包含表中的上一行和下一行?显然,我不能有2个WINDOW
语句或做类似的事情
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
我不确定我是否正确地理解了您的用例:您想检查一个给定的注释是否位于5行中的一行(前2行、当前2行、后2行)。对吗
的窗口
LEAD
或LAG
仅给出一个值,在这种情况下,当前行之后或之前的一个值-如果窗口支持该值;无论窗口包含多少行。但是您想签入这五行中的任何一行array\u agg
聚合这五行中的所有注释(如果可能),这将给出一个数组unnest
为每个元素将此数组扩展为一行,因为我无法使用类似于的搜索数组元素。这将提供以下结果(可在下一步中过滤):
token annotation lemma surrounded_annos
This DEM.PROX this DEM.PROX
This DEM.PROX this VB.COP.3SG.PRES
This DEM.PROX this ART.INDEF
is VB.COP.3SG.PRES be DEM.PROX
is VB.COP.3SG.PRES be VB.COP.3SG.PRES
is VB.COP.3SG.PRES be ART.INDEF
is VB.COP.3SG.PRES be NN.INAN
an ART.INDEF a DEM.PROX
an ART.INDEF a VB.COP.3SG.PRES
an ART.INDEF a ART.INDEF
an ART.INDEF a NN.INAN
example NN.INAN example VB.COP.3SG.PRES
example NN.INAN example ART.INDEF
example NN.INAN example NN.
另一种方法是计算句子中每个标记的相对位置,并执行标记的自连接(这将允许您根据距离选择跳过克):
这似乎基本上满足了我的要求(谢谢!),但是,如果所包围的_annos筛选条件为负(不类似),如果在窗口中找到筛选条件的谓词,是否有方法消除标记?因此:
WHERE引理类似于“an”和“window”。与“%VB.COP%”不同的引理应该返回一个空结果,因为包含“an”的行有一个邻居,类似于“%VB.COP%”
。
token annotation lemma surrounded_annos
This DEM.PROX this DEM.PROX
This DEM.PROX this VB.COP.3SG.PRES
This DEM.PROX this ART.INDEF
is VB.COP.3SG.PRES be DEM.PROX
is VB.COP.3SG.PRES be VB.COP.3SG.PRES
is VB.COP.3SG.PRES be ART.INDEF
is VB.COP.3SG.PRES be NN.INAN
an ART.INDEF a DEM.PROX
an ART.INDEF a VB.COP.3SG.PRES
an ART.INDEF a ART.INDEF
an ART.INDEF a NN.INAN
example NN.INAN example VB.COP.3SG.PRES
example NN.INAN example ART.INDEF
example NN.INAN example NN.
WITH www AS ( -- enumerate word posision with sentences
SELECT line, tno -- candidate key
, row_number() OVER sentence AS rn
FROM tokens
WINDOW sentence AS ( ORDER BY line ASC, tno ASC)
)
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, w1.rn - w0.rn AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
JOIN www w0 ON t0.line = w0.line AND t0.tno= w0.tno -- PK1
JOIN www w1 ON t1.line = w1.line AND t1.tno= w1.tno -- PK2
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;
-- But, if you rno is consecutive(gapless) within lines,
-- you can omit the enumeration step, and do a plain self-join:
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, t1.tno - t0.tno AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;