Python 使用Spacy词性标注提取信息/事实
我试图仅使用POS模式从带有空格的句子中提取事实。这样,Python 使用Spacy词性标注提取信息/事实,python,python-3.x,string,nlp,spacy,Python,Python 3.x,String,Nlp,Spacy,我试图仅使用POS模式从带有空格的句子中提取事实。这样,PROPNs的序列就是一个实体。在下面的代码片段中,我有实体和关系的模式。该模式的一个主要问题是,当模式中存在重叠时,将提取两个或多个候选对象。例如,给定句子两辆蓝色汽车属于Lorry Jim。提取的实体是:['blue cars','two blue cars','Lorry Jim']而不是仅仅['two blue cars','Lorry Jim'],而对于关系,我有['alloy','alloy']而不是仅仅['alloy'] 要获
PROPN
s的序列就是一个实体。在下面的代码片段中,我有实体和关系的模式。该模式的一个主要问题是,当模式中存在重叠时,将提取两个或多个候选对象。例如,给定句子两辆蓝色汽车属于Lorry Jim。
提取的实体是:['blue cars','two blue cars','Lorry Jim']
而不是仅仅['two blue cars','Lorry Jim']
,而对于关系,我有['alloy','alloy']
而不是仅仅['alloy']
要获取实体,我使用以下方法:
def getentities(doc):
entity_patterns = [
[{'POS':'PROPN'},{'POS':'PROPN'}],
[{'POS':'PROPN'},{'POS':'PROPN'},{'POS':'PROPN'}],
[{'POS':'PROPN'},{'POS':'PROPN'},{'POS':'PROPN'}],
[{'POS':'NOUN'},{'POS':'NOUN'}],
[{'POS':'NOUN'},{'POS':'NOUN'},{'POS':'NOUN'}],
[{'POS':'NOUN'},{'POS':'NOUN'},{'POS':'NOUN'}],
[{'POS':'ADJ'},{'POS':'NOUN'}],
[{'POS':'ADJ'},{'POS':'PROPN'}],
[{'POS':'NUM'},{'POS':'NOUN'}],
[{'POS':'NUM'},{'POS':'PROPN'}],
[{'POS':'NUM'},{'POS':'ADJ'},{'POS':'NOUN'}],
[{'POS':'NUM'},{'POS':'ADJ'},{'POS':'PROPN'}],
[{'POS':'ADJ'},{'POS':'NUM'},{'POS':'NOUN'}],
[{'POS':'ADJ'},{'POS':'NUM'},{'POS':'PROPN'}]
]
matcherx = Matcher(nlp.vocab)
for i in range(len(entity_patterns)):
matcherx.add(str(i),None,entity_patterns[i])
doc_entity = []
matches = matcherx(doc)
for match_id, start, end in matches:
span = doc[start:end]
doc_entity.append(span.text)
return doc_entity
获取关系/谓词的步骤
def getpredicates(doc):
entity_patterns = [
[{'POS':'VERB'}],
[{'POS':'VERB'},{'POS':'ADP'}],
[{'POS':'NOUN'},{'POS':'ADP'}],
[{'POS':'NOUN'},{'POS':'ADP'},{'POS':'NOUN'}]
]
matcherx = Matcher(nlp.vocab)
for i in range(len(entity_patterns)):
matcherx.add(str(i),None,entity_patterns[i])
doc_entity = []
matches = matcherx(doc)
for match_id, start, end in matches:
span = doc[start:end]
doc_entity.append(span.text)
return doc_entity
提取三胞胎
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
from itertools import permutations
docx = nlp("Tom Joe is the father of Kim Joe and uncle to Tim Joe")
entities = list(set(getentities(docx)))
relations = list(set(getpredicates(docx)))
per=list(permutations(entities,2))
alltriples = [ (i,x,j) for x in relations for (i,j) in per ]
mainfacts = [x for x in alltriples if x[0]+' '+x[1]+' '+x[2] in docx.text]
有没有更好、简洁、高效的方法来实现这一点?
在许多情况下,使用上述代码都会失败,例如:
docx = nlp("Tom Joe is the father of Kim Joe and uncle to Tim Joe") # expected facts [(Tom Joe, father of, Kim Joe), (Tom Joe, uncle to, Tim Joe)]
docx = nlp("The two blue cars belong to Lorry Jim.") # expected facts [(two blue cars, belong to, Lorry Jim)]