Python 在spaCy';s匹配器
我刚刚在spaCy中为Python 在spaCy';s匹配器,python,methods,nlp,spacy,matcher,Python,Methods,Nlp,Spacy,Matcher,我刚刚在spaCy中为令牌添加了以下扩展: from spacy.tokens import Token has_dep = lambda token,name: name in [child.dep_ for child in token.children] Token.set_extension('HAS_DEP', method=has_dep) 因此,我想检查令牌是否有某个指定的依赖项名称作为其子项之一,因此如下所示: doc = nlp(u'We are walking around
令牌
添加了以下扩展:
from spacy.tokens import Token
has_dep = lambda token,name: name in [child.dep_ for child in token.children]
Token.set_extension('HAS_DEP', method=has_dep)
因此,我想检查令牌是否有某个指定的依赖项名称作为其子项之一,因此如下所示:
doc = nlp(u'We are walking around.')
walking = doc[2]
walking._.HAS_DEP('nsubj')
输出True
,因为“walking”有一个子项,其依赖项标记为“nsubj”(即单词“we”)
但是,我不明白如何将此扩展与spaCy的Matcher一起使用。下面是我写的。我期望的输出是行走,但似乎不起作用:
matcher = Matcher(nlp.vocab)
pattern = [
{"_": {"HAS_DEP": {'name': 'nsubj'}}} # this is the line I'm not sure of
]
matcher.add("depnsubj", None, pattern)
doc = nlp("We're walking around the house.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span)
我想您可以使用
doc.retokenize()
和token.head
来代替,如下所示:
from spacy.matcher import Matcher
import en_core_web_sm
nlp = en_core_web_sm.load()
matcher = Matcher(nlp.vocab)
pattern = [{'DEP': 'nsubj'}]
matcher.add("depnsubj", None, pattern)
doc = nlp("We're walking around the house.")
matches = matcher(doc)
matched_spans = []
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(doc[start:end])
matched_tokens = []
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
for token in span:
print(token.head)
输出:
walking
我认为你的目标可能是通过一个
能手实现的
:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
has_dep = lambda token: 'nsubj' in [child.dep_ for child in token.children]
Token.set_extension('HAS_DEP_NSUBJ', getter=has_dep, force=True)
nlp = spacy.load("en_core_web_md")
matcher = Matcher(nlp.vocab)
matcher.add("depnsubj", None, [{"_": {"HAS_DEP_NSUBJ": True}}])
doc = nlp("We're walking around the house.")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span)
walking
由于
Matcher
模式没有为扩展提供依赖项标签名称的机制,因此我认为这是最有效的解决方案。