Python 使用Spacy词性标注提取信息/事实

Python 使用Spacy词性标注提取信息/事实,python,python-3.x,string,nlp,spacy,Python,Python 3.x,String,Nlp,Spacy,我试图仅使用POS模式从带有空格的句子中提取事实。这样,PROPNs的序列就是一个实体。在下面的代码片段中,我有实体和关系的模式。该模式的一个主要问题是,当模式中存在重叠时,将提取两个或多个候选对象。例如,给定句子两辆蓝色汽车属于Lorry Jim。提取的实体是:['blue cars','two blue cars','Lorry Jim']而不是仅仅['two blue cars','Lorry Jim'],而对于关系,我有['alloy','alloy']而不是仅仅['alloy'] 要获

我试图仅使用POS模式从带有空格的句子中提取事实。这样,
PROPN
s的序列就是一个实体。在下面的代码片段中,我有实体和关系的模式。该模式的一个主要问题是,当模式中存在重叠时,将提取两个或多个候选对象。例如,给定句子
两辆蓝色汽车属于Lorry Jim。
提取的实体是:
['blue cars','two blue cars','Lorry Jim']
而不是仅仅
['two blue cars','Lorry Jim']
,而对于关系,我有
['alloy','alloy']
而不是仅仅
['alloy']

要获取实体,我使用以下方法:

def getentities(doc):
  entity_patterns = [
                     [{'POS':'PROPN'},{'POS':'PROPN'}],
                     [{'POS':'PROPN'},{'POS':'PROPN'},{'POS':'PROPN'}],
                     [{'POS':'PROPN'},{'POS':'PROPN'},{'POS':'PROPN'}],
                     [{'POS':'NOUN'},{'POS':'NOUN'}],
                     [{'POS':'NOUN'},{'POS':'NOUN'},{'POS':'NOUN'}],
                     [{'POS':'NOUN'},{'POS':'NOUN'},{'POS':'NOUN'}],
                     [{'POS':'ADJ'},{'POS':'NOUN'}],
                     [{'POS':'ADJ'},{'POS':'PROPN'}],
                     [{'POS':'NUM'},{'POS':'NOUN'}],
                     [{'POS':'NUM'},{'POS':'PROPN'}],
                     [{'POS':'NUM'},{'POS':'ADJ'},{'POS':'NOUN'}],
                     [{'POS':'NUM'},{'POS':'ADJ'},{'POS':'PROPN'}],
                     [{'POS':'ADJ'},{'POS':'NUM'},{'POS':'NOUN'}],
                     [{'POS':'ADJ'},{'POS':'NUM'},{'POS':'PROPN'}]

                     ]
  matcherx  =  Matcher(nlp.vocab)
  for i in range(len(entity_patterns)):
    matcherx.add(str(i),None,entity_patterns[i])
  doc_entity = []
  matches = matcherx(doc)
  for match_id, start, end in matches:
    span = doc[start:end]
    doc_entity.append(span.text)
  return doc_entity
获取关系/谓词的步骤

def getpredicates(doc):
  entity_patterns = [
                     [{'POS':'VERB'}],
                     [{'POS':'VERB'},{'POS':'ADP'}],
                     [{'POS':'NOUN'},{'POS':'ADP'}],
                     [{'POS':'NOUN'},{'POS':'ADP'},{'POS':'NOUN'}]
                     ]
  matcherx  =  Matcher(nlp.vocab)
  for i in range(len(entity_patterns)):
    matcherx.add(str(i),None,entity_patterns[i])
  doc_entity = []
  matches = matcherx(doc)
  for match_id, start, end in matches:
    span = doc[start:end]
    doc_entity.append(span.text)
  return doc_entity
提取三胞胎

import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
from itertools import permutations 
docx = nlp("Tom Joe is the father of Kim Joe and uncle to Tim Joe")
entities = list(set(getentities(docx)))
relations = list(set(getpredicates(docx)))

per=list(permutations(entities,2))
alltriples  =  [ (i,x,j) for x in relations for (i,j) in per  ]
mainfacts = [x for x in alltriples if x[0]+' '+x[1]+' '+x[2] in  docx.text]
有没有更好、简洁、高效的方法来实现这一点? 在许多情况下,使用上述代码都会失败,例如:

docx = nlp("Tom Joe is the father of Kim Joe and uncle to Tim Joe") # expected facts [(Tom Joe, father of, Kim Joe), (Tom Joe, uncle to, Tim Joe)]

docx = nlp("The two blue cars belong to Lorry Jim.") # expected facts [(two blue cars, belong to, Lorry Jim)]