Python 使用Spacy词性标注提取信息/事实_Python_Python 3.x_String_Nlp_Spacy

Python 使用Spacy词性标注提取信息/事实

python python-3.x string nlp

Python 使用Spacy词性标注提取信息/事实,python,python-3.x,string,nlp,spacy,Python,Python 3.x,String,Nlp,Spacy,我试图仅使用POS模式从带有空格的句子中提取事实。这样，PROPNs的序列就是一个实体。在下面的代码片段中，我有实体和关系的模式。该模式的一个主要问题是，当模式中存在重叠时，将提取两个或多个候选对象。例如，给定句子两辆蓝色汽车属于Lorry Jim。提取的实体是：['blue cars'，'two blue cars'，'Lorry Jim']而不是仅仅['two blue cars'，'Lorry Jim']，而对于关系，我有['alloy'，'alloy']而不是仅仅['alloy'] 要获

我试图仅使用POS模式从带有空格的句子中提取事实。这样，

PROPN

s的序列就是一个实体。在下面的代码片段中，我有实体和关系的模式。该模式的一个主要问题是，当模式中存在重叠时，将提取两个或多个候选对象。例如，给定句子

两辆蓝色汽车属于Lorry Jim。

提取的实体是：

['blue cars'，'two blue cars'，'Lorry Jim']

而不是仅仅

['two blue cars'，'Lorry Jim']

，而对于关系，我有

['alloy'，'alloy']

而不是仅仅

['alloy']

要获取实体，我使用以下方法：

def getentities(doc):
  entity_patterns = [
                     [{'POS':'PROPN'},{'POS':'PROPN'}],
                     [{'POS':'PROPN'},{'POS':'PROPN'},{'POS':'PROPN'}],
                     [{'POS':'PROPN'},{'POS':'PROPN'},{'POS':'PROPN'}],
                     [{'POS':'NOUN'},{'POS':'NOUN'}],
                     [{'POS':'NOUN'},{'POS':'NOUN'},{'POS':'NOUN'}],
                     [{'POS':'NOUN'},{'POS':'NOUN'},{'POS':'NOUN'}],
                     [{'POS':'ADJ'},{'POS':'NOUN'}],
                     [{'POS':'ADJ'},{'POS':'PROPN'}],
                     [{'POS':'NUM'},{'POS':'NOUN'}],
                     [{'POS':'NUM'},{'POS':'PROPN'}],
                     [{'POS':'NUM'},{'POS':'ADJ'},{'POS':'NOUN'}],
                     [{'POS':'NUM'},{'POS':'ADJ'},{'POS':'PROPN'}],
                     [{'POS':'ADJ'},{'POS':'NUM'},{'POS':'NOUN'}],
                     [{'POS':'ADJ'},{'POS':'NUM'},{'POS':'PROPN'}]

                     ]
  matcherx  =  Matcher(nlp.vocab)
  for i in range(len(entity_patterns)):
    matcherx.add(str(i),None,entity_patterns[i])
  doc_entity = []
  matches = matcherx(doc)
  for match_id, start, end in matches:
    span = doc[start:end]
    doc_entity.append(span.text)
  return doc_entity

获取关系/谓词的步骤

def getpredicates(doc):
  entity_patterns = [
                     [{'POS':'VERB'}],
                     [{'POS':'VERB'},{'POS':'ADP'}],
                     [{'POS':'NOUN'},{'POS':'ADP'}],
                     [{'POS':'NOUN'},{'POS':'ADP'},{'POS':'NOUN'}]
                     ]
  matcherx  =  Matcher(nlp.vocab)
  for i in range(len(entity_patterns)):
    matcherx.add(str(i),None,entity_patterns[i])
  doc_entity = []
  matches = matcherx(doc)
  for match_id, start, end in matches:
    span = doc[start:end]
    doc_entity.append(span.text)
  return doc_entity

提取三胞胎

import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
from itertools import permutations 
docx = nlp("Tom Joe is the father of Kim Joe and uncle to Tim Joe")
entities = list(set(getentities(docx)))
relations = list(set(getpredicates(docx)))

per=list(permutations(entities,2))
alltriples  =  [ (i,x,j) for x in relations for (i,j) in per  ]
mainfacts = [x for x in alltriples if x[0]+' '+x[1]+' '+x[2] in  docx.text]

有没有更好、简洁、高效的方法来实现这一点？在许多情况下，使用上述代码都会失败，例如：

docx = nlp("Tom Joe is the father of Kim Joe and uncle to Tim Joe") # expected facts [(Tom Joe, father of, Kim Joe), (Tom Joe, uncle to, Tim Joe)]

docx = nlp("The two blue cars belong to Lorry Jim.") # expected facts [(two blue cars, belong to, Lorry Jim)]