Python 生成bigram，但仅生成名词和动词组合_Python_Nlp_Nltk_Spacy

Python 生成bigram，但仅生成名词和动词组合

python nlp

Python 生成bigram，但仅生成名词和动词组合,python,nlp,nltk,spacy,Python,Nlp,Nltk,Spacy,下面有一些代码为我的数据帧列生成bigram import nltk import collections counts = collections.Counter() for sent in df["message"]: words = nltk.word_tokenize(sent) counts.update(nltk.bigrams(words)) counts = {k: v for k, v in counts.items() if v > 25} 这对于在数

下面有一些代码为我的数据帧列生成bigram

import nltk
import collections
counts = collections.Counter()
for sent in df["message"]:
    words = nltk.word_tokenize(sent)
    counts.update(nltk.bigrams(words))
counts = {k: v for k, v in counts.items() if v > 25}

这对于在数据帧的“message”列中生成最常见的bigram非常有用，但是，我希望得到每对bigram只包含一个动词和一个名词的bigram

任何帮助spaCy或nltk这样做将不胜感激

您必须先应用pos_标记，然后再应用bigrams

你可以这样试试

import nltk

sent = 'The thieves stole the paintings'
token_sent = nltk.word_tokenize(sent)
tagged_sent = nltk.pos_tag(token_sent)

word_tag_pairs = nltk.bigrams(tagged_sent)

##Apply conditions according to your requirement to filter the bigrams

print([(a,b) for a, b in word_tag_pairs if a[1].startswith('N') and b[1].startswith('V')])

它只是给出了一个

[(('thieves', 'NNS'), ('stole', 'VBD'))]

您必须先应用pos_标记，然后再应用bigrams

你可以这样试试

import nltk

sent = 'The thieves stole the paintings'
token_sent = nltk.word_tokenize(sent)
tagged_sent = nltk.pos_tag(token_sent)

word_tag_pairs = nltk.bigrams(tagged_sent)

##Apply conditions according to your requirement to filter the bigrams

print([(a,b) for a, b in word_tag_pairs if a[1].startswith('N') and b[1].startswith('V')])

它只是给出了一个

[(('thieves', 'NNS'), ('stole', 'VBD'))]

通过

spaCy

，您可以访问各种语言的预先培训。您可以这样安装它们：

python-mspacy下载en\u core\u web\u sm

然后，您可以轻松地运行类似这样的操作来执行自定义筛选：

import spacy

text = "The sleeping cat thought that sitting in the couch resting would be a great idea."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i in range(len(doc)):
    j = i+1
    if j < len(doc):
        if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN"):
            print(doc[i].text, doc[j].text, doc[i].pos_, doc[j].pos_)

导入空间
text=“熟睡的猫认为坐在沙发上休息是个好主意。”
nlp=spacy.load（'en\u core\u web\u sm'）
doc=nlp（文本）
对于范围内的i（len（doc））：
j=i+1
如果j


哪个会输出
睡猫动词名词
猫思维名词动词
名词动词
通过spaCy
，您可以访问各种语言的预先培训。您可以这样安装它们：python-mspacy下载en\u core\u web\u sm

然后，您可以轻松地运行类似这样的操作来执行自定义筛选：
import spacy

text = "The sleeping cat thought that sitting in the couch resting would be a great idea."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i in range(len(doc)):
    j = i+1
    if j < len(doc):
        if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN"):
            print(doc[i].text, doc[j].text, doc[i].pos_, doc[j].pos_)

导入空间
text=“熟睡的猫认为坐在沙发上休息是个好主意。”
nlp=spacy.load（'en\u core\u web\u sm'）
doc=nlp（文本）
对于范围内的i（len（doc））：
j=i+1
如果j

哪个会输出
睡猫动词名词
猫思维名词动词
名词动词
举个例子就好了。@acodejdatam你是说N，V和V，N双克吗？@ongenz，是的。我只想要名词-动词和动词-名词-双格。举个例子就好了。@acodejdatam你是说N，V和V，N双格吗？@ongenz，是的。我只想要名词-动词和动词-名词-大字。谢谢你的帮助！我如何使用spaCy而不是一个文本字符串在文本的数据框列上运行它？欢迎！要在大量文本上高效运行spaCy
，请使用nlp.pipe（text）
-请参见此处：Sofie VL您可以提供一个带有数据帧的代码示例。假设我的数据框中有一列是文本。每行是一个句子。如何在上面运行此代码？我无法从文档中找到答案。当然，你必须先从专栏中提取文本。类似于text=df[“message”]
。这不是spaCy库的一部分，而是您端需要进行的预处理…感谢您的帮助！我如何使用spaCy而不是一个文本字符串在文本的数据框列上运行它？欢迎！要在大量文本上高效运行spaCy
，请使用nlp.pipe（text）
-请参见此处：Sofie VL您可以提供一个带有数据帧的代码示例。假设我的数据框中有一列是文本。每行是一个句子。如何在上面运行此代码？我无法从文档中找到答案。当然，你必须先从专栏中提取文本。类似于text=df[“message”]
。这不是spaCy库的一部分，但需要在您的端进行预处理。。。