Python中的Spacy正则表达式短语匹配器_Python_Regex_Spacy_Match Phrase

Python中的Spacy正则表达式短语匹配器

python regex

Python中的Spacy正则表达式短语匹配器,python,regex,spacy,match-phrase,Python,Regex,Spacy,Match Phrase,在一个庞大的文本语料库中，我感兴趣的是提取每个句子中某个地方有一个特定的（动词-名词）或（形容词-名词）列表的句子。我有一个很长的清单，但这里有一个样本。在我的MWE中，我试图用“write/writed/writes”和“book/s”来提取句子。我有大约30对这样的单词以下是我尝试过的，但它没有抓住大多数句子： import spacy nlp = spacy.load('en_core_web_sm') from spacy.matcher import Matcher matcher

在一个庞大的文本语料库中，我感兴趣的是提取每个句子中某个地方有一个特定的（动词-名词）或（形容词-名词）列表的句子。我有一个很长的清单，但这里有一个样本。在我的MWE中，我试图用“write/writed/writes”和“book/s”来提取句子。我有大约30对这样的单词

以下是我尝试过的，但它没有抓住大多数句子：

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')

matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)

for sent in doc.sents:
    if matcher(nlp(sent.lemma_)):
        print(sent.text)

不幸的是，我只得到一个匹配项：

“在写这本书时，他不得不避开外星人和恐龙。”

然而，我也希望得到“他写了他的第一本书”这句话。其他书写书籍将writer作为一个名词，因为它不匹配。

问题是，在Matcher中，默认情况下，模式中的每个词典都对应于一个标记。所以你的正则表达式不匹配任何数量的字符，它匹配任何一个标记，这不是你想要的

要获得所需的内容，可以使用

OP

值指定要匹配任意数量的令牌。请参阅文档中的

然而，考虑到您的问题，您可能希望实际使用依赖项匹配器，因此我重写了您的代码以使用它。试试这个：

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him. 
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around 
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")

matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])

print("----- Using Matcher -----")
for sent in doc.sents:
    if matcher(sent):
        print(sent.text)

print("----- Using Dependency Matcher -----")

deppattern = [
        {"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
        {"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book", 
            "RIGHT_ATTRS": {"LEMMA": "book"}}
        ]

from spacy.matcher import DependencyMatcher

dmatcher = DependencyMatcher(nlp.vocab)

dmatcher.add("BOOK", [deppattern])

for _, (start, end) in dmatcher(doc):
    print(doc[start].sent)

还有一件不太重要的事——你给媒人打电话的方式有点奇怪。您可以传递matcher文档或span，但它们肯定应该是自然文本，因此在句子上调用

.lemma\uuu

，并根据您的案例创建一个新的文档，但通常应该避免。非常感谢您的回答。正在谷歌上阅读DependencyMatcher。谢谢你给我介绍。但是，当我运行您的代码时，出现了以下错误：“[E098]指定的模式无效：需要规范和模式。”听起来您使用的是spaCy v2。我的代码是为v3编写的，我建议您升级-v2不支持依赖项匹配器。是的，我的spacy是2.3.5，我将升级。对于代码，我尝试了这个，它成功了。deppattern=[{SPEC'：{“NODE_NAME”：“write”}，PATTERN:{“LEMMA”：“write”}，{SPEC'：{“NODE_NAME”：“book”}，PATTERN:{“LEMMA”：“write”}，{“SPEC”：{“NBOR_NAME”：“write”；“NBOR_RELOP”：“>，“NODE_NAME”：“book”}，PATTERN:{“LEMMA LEMMA”：“book”}谢谢！快速提问：如果不是“写>书”，而是“草稿>博士论文”或“写>裁判报告”或“准备>一些长短语”，那么如何更改依赖模式代码。基本上，当右边的依赖词不是一个词，而是一个短语，在普通英语中可能不一定有意义时，这取决于依赖解析看起来像什么。合并名词块可能会使它更简单，但我建议您查看目标句子的依赖项解析，并找出如何分解它。