Python 用nltk改进人名提取

Python 用nltk改进人名提取,python,nlp,nltk,Python,Nlp,Nltk,我试图从文本中提取人名 有人会推荐一种方法吗 这是我尝试的(代码如下): 我正在使用nltk查找标记为个人的所有内容,然后生成该个人的所有NNP部分的列表。我跳过那些只有一个NNP的人,这样可以避免使用一个单独的姓氏 我得到了不错的结果,但我想知道是否有更好的方法来解决这个问题 代码: 输出: LAST, FIRST Velde, Francois Branson, Richard Galactic, Virgin Krugman, Paul Summers, Larry Colas, Nick

我试图从文本中提取人名

有人会推荐一种方法吗

这是我尝试的(代码如下): 我正在使用
nltk
查找标记为个人的所有内容,然后生成该个人的所有NNP部分的列表。我跳过那些只有一个NNP的人,这样可以避免使用一个单独的姓氏

我得到了不错的结果,但我想知道是否有更好的方法来解决这个问题

代码:

输出:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

除了维珍银河,这都是有效的输出。当然,在本文的上下文中,要知道维珍银河不是人名是很困难的(可能是不可能的)。

您可以尝试解析找到的人名,并检查是否可以在freebase.com等数据库中找到它们。在本地获取数据并进行查询(在RDF中),或者使用google的api:。基于freebase数据,大多数大公司、地理位置等(可能会被您的代码片段捕获)可能会被丢弃

必须同意“让我的代码更好”不适合这个网站的建议,但我可以给你一些方法,你可以尝试挖掘

看一看。它的绑定已经包含在NLTK v 2.0中,但您必须下载一些核心文件。这是一个可以为你做所有这些的

我写了这个剧本:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag
并得到了不错的结果:

(“弗朗索瓦”,“人”) (‘R.’,‘PERSON’) (‘Velde’、‘PERSON’) (“理查德”,“人”) (“布兰森”、“人”) (‘处女’、‘人’) (‘银河’、‘人’) (“比特币”、“人”) (“比特币”、“人”) (“保罗”,“人”) (“克鲁格曼”,“人”) (“拉里”,“人”) (“Summers”、“PERSON”) (“比特币”、“人”) (“尼克”,“人”) (“可乐”、“人”)


希望这有帮助。

对于其他正在查看的人,我发现这篇文章很有用:


这对我来说很有效。为了让它运行,我只需要换一行

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
需要

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

输出中存在缺陷(例如,它将“洗钱”识别为一个人),但根据我的数据,姓名数据库可能不可靠

对@特洛伊的回答对我来说不太管用,但对这一点帮助很大

预请求站点 创建文件夹
stanford ner
,并将以下两个文件下载到其中:

  • (查找并提取存档)
剧本 结果
实际上,我只想提取人名,因此,我想根据wordnet(一个大型英语词汇数据库)检查所有作为输出的人名。 有关Wordnet的更多信息,请访问:

输出

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

除了拉里·萨默斯,所有的名字都是正确的,这是因为他的姓“萨默斯”

我想在这里发布一个残酷而贪婪的解决方案,来解决@fander提出的问题:如果可能的话,获取一个人的全名

每个姓名中第一个字符的大小写用作识别
Spacy
中人物的标准。例如,“吉姆·霍夫曼”本身将不会被识别为命名实体,而“吉姆·霍夫曼”将被识别为命名实体

因此,如果我们的任务只是从脚本中挑选人物,那么我们可以简单地将每个单词的第一个字母大写,然后将其转储到
spacy

导入空间
定义大写字母(文本):
新文本=“”
对于文本中的句子。拆分('.'):
新闻事件=“”
用于句子中的单词。拆分():
新闻事件+=单词+“”
newText+=newSentence+'\n'
返回新文本
nlp=spacy.load('en\u core\u web\u md')
doc=nlp(大写字(原始文本))
#......

请注意,这种方法以增加误报为代价覆盖全名。

虽然有趣,但不清楚这里的实际问题是什么。“改进我的代码”的建议不适合这个网站。谢谢,基本上我的问题是:我想从文本中提取名称。这是我尝试过的,它工作正常,但不是非常好。有没有人会推荐其他方法来解决这个问题?我将编辑问题以改进它。谢谢分享。我可以使用您的代码,但我遇到了两个需要修复的错误。首先我得到了错误:
SyntaxError:Non-ASCII字符。。。。没有声明任何编码
,这是通过在第1行添加来修复的:
#--编码:UTF-8--
然后我得到了错误:
未实现错误(“使用标签()访问节点标签。
这是通过从第17行删除“节点”来修复的,如下所示:
用于sentt.subtrees中的子树(filter=lambda t:t.label()='PERSON')):
如果您希望今天使用此代码,请确保将这些代码放在导入语句之后。nltk.download('punkt');nltk.download('averaged_perceptron_tagger');nltk.download('maxent_ne_chunker');nltk.download('words');此外,请确保将t.node替换为t.label()他希望输出为名字和姓氏。NER将只提供人员标签。此解决方案单独提供名字和姓氏,而不是组合在一起。如果有中间名,则会遇到问题。更糟糕的是,如果名称包含四个单词,则如果我们仅组合两个连续单词,则会将其分组为两个名称找到一个名字。这并不能回答这个问题。谢谢!这对不同语言的名字也有效吗?如果不行,那么怎么做?我用的是印度名字。这部分没有经过:“from nltk.tag.stanford import NERTagger“@tursunWali很抱歉听到这个消息。这个答案已经有7年历史了。它肯定需要更新到较新的Python和NLTK版本。这个api已经退役了。如果一个名称有中间名,它将无法承受。我在较新版本的NLTK中遇到了这个错误:notimplementederror use label()访问节点标签。通过将最后两行更改为以下内容来解决此问题:if hasattr(chunk,'label'):print(chun
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)
(u'Bitcoin', u'LOCATION')       # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON')         # Wrong
(u'Galactic', u'PERSON')       # Wrong
(u'Bitcoin', u'PERSON')        # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON')        # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')     
(u'Bitcoin', u'LOCATION')       # Wrong
(u'BTC', u'ORGANIZATION')       # Wrong
import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet


person_list = []
person_names=person_list
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)
['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']