使用python从文本中获取人员和组织名称的方法

使用python从文本中获取人员和组织名称的方法,python,regex,nltk,Python,Regex,Nltk,我很好奇从文本中获取个人和组织名称的准确方法是什么。我想从文本中画出基于合作关系等的附属网络 我尝试了几种方法: •使用nltk POS-工作速度太慢,因此我放弃了它 •使用正则表达式匹配是否有首字母大写的连续单词。然而,这导致了许多例外和捕获,其中mahy不是很相关(例如,当某人随机将“社会创新奖”大写时)。此外,这遗漏了只有一个词的名字 其他人还有其他想法吗 文本示例 obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDav

我很好奇从文本中获取个人和组织名称的准确方法是什么。我想从文本中画出基于合作关系等的附属网络

我尝试了几种方法: •使用nltk POS-工作速度太慢,因此我放弃了它 •使用正则表达式匹配是否有首字母大写的连续单词。然而,这导致了许多例外和捕获,其中mahy不是很相关(例如,当某人随机将“社会创新奖”大写时)。此外,这遗漏了只有一个词的名字

其他人还有其他想法吗

文本示例

obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha 
Piper\r\n\r\nThe award was presented during the closing dinner of the Social 
Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event 
gathered\r\nover 250 business, academic and social thought leaders from the 
social\r\nentrepreneurship sector in Canada who had convened for a full day of 
inspiration\r\nand engagement on ways to address some of the most pressing issues of our 

times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead 

an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, 

products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by 

MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit 
首先清理数据:

>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = [i.replace('\r\n','').strip() for i in text.split('\r\n\r')]>>> text
['obin Cardozo', 'Ed Greenspon', 'Farouk Jiwa', 'David Pecaut', 'Martha Piper', 'The award was presented during the closing dinner of the Social EntrepreneurshipSummit held at MaRS Centre for Social Innovation in Toronto. The event gatheredover 250 business, academic and social thought leaders from the socialentrepreneurship sector in Canada who had convened for a full day of inspirationand engagement on ways to address some of the most pressing issues of our times.', 'An often under-recognized community, social entrepreneurs create and lead anorganization that are aimed at catalyzing systemic social change through newideas, products, services, methodologies and changes in attitude.', 'Hosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), theCentre for Social Innovation and the Toronto City Summit Alliance, the SocialEntrepreneurship Summit']
然后,您将需要一个全面的
名称实体识别器
,尝试以NLTK
ne_chunk
为起点,然后转向更“先进”的NER识别器:

from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.tree import Tree
from nltk import batch_ne_chunk as bnc
chunked_text = [[bnc(pos_tag(word_tokenize(j)) for j in sent_tokenize(i))] for i in text]

这似乎没有遵循任何模式。如果您不想使用nltk,您可以简单地拆分\r\n然后提取您想要的几个项目。非常感谢。你能给我举一个使用NER识别器的例子吗?