Regex 使用正则表达式识别命名实体：NLTK_Regex_Nlp_Nltk_Named Entity Recognition

Regex 使用正则表达式识别命名实体：NLTK

regex nlp

Regex 使用正则表达式识别命名实体：NLTK,regex,nlp,nltk,named-entity-recognition,Regex,Nlp,Nltk,Named Entity Recognition,我一直在玩NLTK工具包。我经常遇到这个问题，并在网上寻找解决方案，但没有找到满意的答案。所以我把我的问题放在这里很多时候，NER不会将连续NNP标记为一个NE。我认为编辑NER以使用regexptager也可以改进NER 例如：输入：巴拉克·奥巴马是一个伟大的人输出：树（'S'，[Tree（'PERSON'，[（'Barack'，'NNP'）]），树（'ORGANIZATION'，[（'Obama'，'NNP'）]），（'is'，'VBZ'），（'a'，'DT'），（'great'，

我一直在玩NLTK工具包。我经常遇到这个问题，并在网上寻找解决方案，但没有找到满意的答案。所以我把我的问题放在这里

很多时候，NER不会将连续NNP标记为一个NE。我认为编辑NER以使用regexptager也可以改进NER

例如：

输入：

巴拉克·奥巴马是一个伟大的人

输出：

树（'S'，[Tree（'PERSON'，[（'Barack'，'NNP'）]），树（'ORGANIZATION'，[（'Obama'，'NNP'）]），（'is'，'VBZ'），（'a'，'DT'），（'great'，'JJ'），（'PERSON'，'NN'），（'，'））

何处为

输入：

前副总统迪克·切尼对保守党电台主持人劳拉·英格拉汉姆说，他“很荣幸”在任职期间被比作达斯·维德

输出：

Tree（'S'，[（'Former'，'JJ'），（'Vice'，'NNP'），（'President'，'NNP'），（'NE'，[（'Dick'，'NNP'），（'Cheney'，'NNP'），（'teld'，'VBD'），（'conservative'，'JJ'），（'radio'，'NN'），（'host'，'NNP'），Tree（'NE'，[（'laurar'，'NNP'），（'Ingraham'，'NNP'））），（'that'，'IN'），（'he'，'PRP'），（'code>，'was'，'VBD'），（'was'，'VBD'），'S'，'，'，（'to'，'to'），（'be'，'VB'），（'compared'，'VBN'），（'to'，'to'），Tree（'NE'，[（'Darth'，'NNP'），（'Vader'，'NNP'）），（'while'，'IN'），（'IN'，'IN'，'IN'），（'office'，'NN'），（'NN'，'

这里正确地提取了副/NNP，总统/NNP（迪克/NNP，切尼/NNP）

所以我认为如果首先使用nltk.ne_chunk，然后如果两个连续的树都是NNP，那么很有可能两者都指向一个实体

任何建议都将不胜感激。我在寻找我的方法中的缺陷

谢谢

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[out]：

['Barack Obama']

但是请注意，如果连续块不应该是单个网元，那么您将把多个网元组合成一个网元。我脑子里想不出这样的例子，但我相信它会发生的。但如果它们不是连续的，则上面的脚本可以正常工作：

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

@阿尔瓦回答得很好。这真的很有帮助。我已尝试以更实用的方式捕获您的解决方案。但仍需改进

    def conditions(tree_node):
    return tree_node.height() == 2

    def coninuous_entities(self, input_text, file_handle):
      from nltk import ne_chunk, pos_tag, word_tokenize
      from nltk.tree import Tree

      # Note: Currently, the chunker categorizes only 2 'NNP' together.  
      docs = input_text.split('\n')
      for input_text in docs:
          chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
          child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]

          named_entities = []
          for child in child_data:
              if type(child) == Tree:
                  named_entities.append(" ".join([token for token, pos in child.leaves()]))

          # Dump all entities to file for now, we will see how to go about that
          if file_handle is not None:
              file_handle.write('\n'.join(named_entities) + '\n')
      return named_entities

使用conditions函数可以向过滤器添加许多条件

在@alvas的答案中有一个bug。Fencepost错误。确保在循环之外运行elif检查，这样就不会遗漏出现在句子末尾的NE。因此：

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)

感谢您提供了漂亮的代码，但是您是否看到了将连续的NNP组合为一个命名实体的任何缺陷。我无法立即想到一个示例，但我确信会有不应该是一个NE的连续NPs。谢谢您的回答。我认为有一类可能的例子会包含双及物动词，例如“他引用了米歇尔·巴拉克·奥巴马的话”，尽管这种情况确实很少见；P他引用米歇尔和巴拉克·奥巴马的话说：“巴拉克·奥巴马做得好吗？”回答“巴拉克·奥巴马做得好吗？”。你怎么解决这个问题？