Python NLTK树中叶子的绝对位置_Python_Tree_Nlp_Nltk_Chunking

Python NLTK树中叶子的绝对位置

python tree nlp

Python NLTK树中叶子的绝对位置,python,tree,nlp,nltk,chunking,Python,Tree,Nlp,Nltk,Chunking,我试图找出给定句子中一个名词短语的跨度（开始索引，结束索引）。下面是提取名词短语的代码 sent=nltk.word_tokenize(a) sent_pos=nltk.pos_tag(sent) grammar = r""" NBAR: {<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns NP: {<NBAR>}

我试图找出给定句子中一个名词短语的跨度（开始索引，结束索引）。下面是提取名词短语的代码

sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    VP:
        {<VBD><PP>?}
        {<VBZ><PP>?}
        {<VB><PP>?}
        {<VBN><PP>?}
        {<VBG><PP>?}
        {<VBP><PP>?}
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
  np = ''
  for x in subtree.leaves():
    np = np + ' ' + x[0]
  nounPhrases.append(np.strip())

sent=nltk.word\u标记化（a）
已发送位置=nltk.pos位置标签（已发送）
语法=r“”
NBAR：
{*}#名词和形容词，以名词结尾
NP:
{}
上面的{}#与in/of/etc有关。。。
副总裁：
{?}
{?}
{?}
{?}
{?}
{?}
"""
cp=nltk.RegexpParser（语法）
结果=cp.parse（发送位置）
名词短语=[]
对于result.subtrees中的子树（filter=lambda t:t.label（）=='NP'）：
np=“”
对于子树中的x.leaves（）：
np=np+''+x[0]
名词短语.append（np.strip（））

对于a=“美国内战，也被称为州与州之间的战争或简称内战，是1861年至1865年在几个南部奴隶州宣布脱离联邦并组成美利坚合众国后在美国进行的内战。”，名词短语摘录如下：

[‘美国内战’、‘战争’、‘各州’、‘内战’、‘内战’、‘美国’、‘几个南方州’、‘各州’、‘分离’、‘邦联州’、‘美国’]

现在我需要找到名词短语的跨度（短语的起始位置和结束位置）。例如，上述名词短语的跨度将为

[（1,3），（9,9），（12,12），（16,17），（21,23），…]

我是NLTK的新手，我已经研究过。我尝试使用Tree.treepositions（），但无法使用这些索引提取绝对位置。任何帮助都将不胜感激。谢谢大家!

没有任何隐式函数返回

但是您可以使用一个ngram搜索器，该搜索器由发件人使用

（它返回查询ngram的起始位置）

这里是另一种方法，它用标记在树字符串中的绝对位置来扩充标记。现在可以从任何子树的叶子中提取绝对位置

def将索引添加到终端（treestring）：
tree=ParentedTree.fromstring（treestring）
对于idx，枚举（tree.leaves（））中的
树位置=树。叶位置（idx）
非_终端=树[树位置[：-1]]
非_终端[0]=非_终端[0]+“uu”+str（idx）
返回str（树）

示例用例

>treestring=（S（NP（NNP-John））（VP（V-runs）））
>>>将索引添加到终端（treestring）
(S(NP(NNP John_0)(VP(V runs_1))

使用以下代码实现了成分解析树的令牌偏移：

def get_tok_idx_of_tree(t, mapping_label_2_tok_idx, count_label, i):
    if isinstance(t, str):
        pass
    else:
        if count_label[0] == 0:
            idx_start = 0
        elif i == 0:
            idx_start = mapping_label_2_tok_idx[list(mapping_label_2_tok_idx.keys())[-1]][0]
        else:
            idx_start = mapping_label_2_tok_idx[list(mapping_label_2_tok_idx.keys())[-1]][1] + 1

        idx_end = idx_start + len(t.leaves()) - 1
        mapping_label_2_tok_idx[t.label() + "_" + str(count_label[0])] = (idx_start, idx_end)
        count_label[0] += 1

        for i, child in enumerate(t):
            get_tok_idx_of_tree(child, mapping_label_2_tok_idx, count_label, i)

下面是一个组成树：

上述代码的输出：

{'ROOT_0': (0, 3), 'S_1': (0, 3), 'VP_2': (0, 2), 'VB_3': (0, 0), 'NP_4': (1, 2), 'DT_5': (1, 1), 'NN_6': (2, 2), '._7': (3, 3)}

谢谢你的回复！但在NP“War”的情况下，由于同一个单词在句子中多次出现，

位置（tuple（'War'.split（）），s）

将返回第一次出现的索引，即3，而提取的NP位于索引9。这有什么办法吗？再次感谢@Corleone，因为没有“偏移量”的概念，所以您最好要么获取ngram的第一个实例，要么递归获取“下一个”实例。@alvas，谢谢！我想我会的。我试图接受答案，但我的业力到目前为止还不够好：/

{'ROOT_0': (0, 3), 'S_1': (0, 3), 'VP_2': (0, 2), 'VB_3': (0, 0), 'NP_4': (1, 2), 'DT_5': (1, 1), 'NN_6': (2, 2), '._7': (3, 3)}