Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 使用正则表达式识别命名实体:NLTK_Regex_Nlp_Nltk_Named Entity Recognition - Fatal编程技术网

Regex 使用正则表达式识别命名实体:NLTK

Regex 使用正则表达式识别命名实体:NLTK,regex,nlp,nltk,named-entity-recognition,Regex,Nlp,Nltk,Named Entity Recognition,我一直在玩NLTK工具包。我经常遇到这个问题,并在网上寻找解决方案,但没有找到满意的答案。所以我把我的问题放在这里 很多时候,NER不会将连续NNP标记为一个NE。我认为编辑NER以使用regexptager也可以改进NER 例如: 输入: 巴拉克·奥巴马是一个伟大的人 输出: 树('S',[Tree('PERSON',[('Barack','NNP')]),树('ORGANIZATION',[('Obama','NNP')]),('is','VBZ'),('a','DT'),('great',

我一直在玩NLTK工具包。我经常遇到这个问题,并在网上寻找解决方案,但没有找到满意的答案。所以我把我的问题放在这里

很多时候,NER不会将连续NNP标记为一个NE。我认为编辑NER以使用regexptager也可以改进NER

例如:

输入:

巴拉克·奥巴马是一个伟大的人

输出:

树('S',[Tree('PERSON',[('Barack','NNP')]),树('ORGANIZATION',[('Obama','NNP')]),('is','VBZ'),('a','DT'),('great','JJ'),('PERSON','NN'),(','))

何处为

输入:

前副总统迪克·切尼对保守党电台主持人劳拉·英格拉汉姆说,他“很荣幸”在任职期间被比作达斯·维德

输出:

Tree('S',[('Former','JJ'),('Vice','NNP'),('President','NNP'),('NE',[('Dick','NNP'),('Cheney','NNP'),('teld','VBD'),('conservative','JJ'),('radio','NN'),('host','NNP'),Tree('NE',[('laurar','NNP'),('Ingraham','NNP'))),('that','IN'),('he','PRP'),('code>,'was','VBD'),('was','VBD'),'S',',',('to','to'),('be','VB'),('compared','VBN'),('to','to'),Tree('NE',[('Darth','NNP'),('Vader','NNP')),('while','IN'),('IN','IN','IN'),('office','NN'),('NN','

这里正确地提取了副/NNP,总统/NNP(迪克/NNP,切尼/NNP)

所以我认为如果首先使用nltk.ne_chunk,然后如果两个连续的树都是NNP,那么很有可能两者都指向一个实体

任何建议都将不胜感激。我在寻找我的方法中的缺陷

谢谢

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)
[out]:

['Barack Obama']
但是请注意,如果连续块不应该是单个网元,那么您将把多个网元组合成一个网元。我脑子里想不出这样的例子,但我相信它会发生的。但如果它们不是连续的,则上面的脚本可以正常工作:

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

@阿尔瓦回答得很好。这真的很有帮助。我已尝试以更实用的方式捕获您的解决方案。但仍需改进

    def conditions(tree_node):
    return tree_node.height() == 2

    def coninuous_entities(self, input_text, file_handle):
      from nltk import ne_chunk, pos_tag, word_tokenize
      from nltk.tree import Tree

      # Note: Currently, the chunker categorizes only 2 'NNP' together.  
      docs = input_text.split('\n')
      for input_text in docs:
          chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
          child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]

          named_entities = []
          for child in child_data:
              if type(child) == Tree:
                  named_entities.append(" ".join([token for token, pos in child.leaves()]))

          # Dump all entities to file for now, we will see how to go about that
          if file_handle is not None:
              file_handle.write('\n'.join(named_entities) + '\n')
      return named_entities

使用conditions函数可以向过滤器添加许多条件

在@alvas的答案中有一个bug。Fencepost错误。确保在循环之外运行elif检查,这样就不会遗漏出现在句子末尾的NE。因此:

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)

感谢您提供了漂亮的代码,但是您是否看到了将连续的NNP组合为一个命名实体的任何缺陷。我无法立即想到一个示例,但我确信会有不应该是一个NE的连续NPs。谢谢您的回答。我认为有一类可能的例子会包含双及物动词,例如“他引用了米歇尔·巴拉克·奥巴马的话”,尽管这种情况确实很少见;P他引用米歇尔和巴拉克·奥巴马的话说:“巴拉克·奥巴马做得好吗?”回答“巴拉克·奥巴马做得好吗?”。你怎么解决这个问题?