名称实体识别在Python中无法正确识别多个单词_Python_Python 3.x_Nlp_Nltk

名称实体识别在Python中无法正确识别多个单词

python python-3.x nlp

名称实体识别在Python中无法正确识别多个单词,python,python-3.x,nlp,nltk,Python,Python 3.x,Nlp,Nltk,因此，我尝试在Python中使用NLTK来执行名称实体识别任务。然而，这并不能完成我希望它完成的工作我有一个如下的输入文件 magic johnson is the owner of l.a. lakers. ceo kevin johnson citing issues related to covid 19 as the reason for the deal’s termination. susan somersille johnson went to school to be an e

因此，我尝试在Python中使用NLTK来执行名称实体识别任务。然而，这并不能完成我希望它完成的工作

我有一个如下的输入文件

magic johnson is the owner of l.a. lakers.
ceo kevin johnson citing issues related to covid 19 as the reason for the deal’s termination.
susan somersille johnson went to school to be an engineer, yet somehow managed to carve out a nearly 30 year career in marketing.

我现在的代码是

import nltk
import re

sentence = ""

with open("test.txt", "r") as inFile:
    sentence = inFile.read()

for sent in nltk.sent_tokenize(sentence):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

然而，这只是错误地承认了名称、组织和产出

ORGANIZATION l.a.

为什么这样不行？我怎样才能做到这一点

我的预期输出是这样的

magic johnson
l.a. lakers
Kevin johnson
etc.

你如何定义“工作”？你期望它在这里做什么？如果在调用

hasattr

之前打印各个块，您将看到它正确识别了句子的所有名词、动词和其他部分。NLTK是一个极其复杂的工具。我想你还有更多的阅读要做。@TimRoberts我的意思是，我需要输出来捕捉名字和其他名字，比如“魔术师约翰逊”和“洛杉矶湖人队”等等。但这里没有。@TimRoberts，正如你提到的。如果你这样看输出，它会给你

（'magic'，'JJ'）

和

（'johnson'，'NN'）

。你告诉它把文本分成单词。如果您希望它将单词分组成更大的名称，则必须使用不同类型的标记化。浏览NLTK手册。它有很多子模块，可以进行很多不同类型的语义分析。