Python 使用特定单词提取句子

Python 使用特定单词提取句子,python,pandas,nltk,Python,Pandas,Nltk,我有一个带有文本列的excel文件。我所需要做的就是从文本列中为每一行提取带有特定单词的句子 我尝试过使用定义函数 import pandas as pd from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize #################Reading in excel file##################### str_df = pd.read_excel("C:\\Us

我有一个带有文本列的excel文件。我所需要做的就是从文本列中为每一行提取带有特定单词的句子

我尝试过使用定义函数

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
    sentences=sent_tokenize(text)
    return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")
但是,如果我必须找到包含多个特定单词的句子,例如
毒液
蟒蛇
,有人能帮我吗。这个句子应该至少有一个单词。我无法使用
nltk。使用多个单词标记

要搜索的
words=[“蛇”、“毒蛇”、“蟒蛇”]

输入Excel文件:

                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.
所需输出:

                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.
名为Context的列追加到上面的文本列。上下文列应类似于:

 1.  [Snakes are venomous.] [Anaconda is venomous.]
 2.  [Anaconda lives in Amazon.] [It is venomous.]
 3.  [Snakes,snakes,snakes everywhere!] [The least I expect is an    anaconda.Because it is venomous.]
 4.  NULL
提前谢谢

以下是方法:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object
您会发现有两个问题,因为
sent\u标记器
由于标点符号而无法正常工作


更新:处理复数

以下是更新的df:

text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes


df = pd.read_clipboard(sep='0')
我们可以使用词干分析器(),例如

首先,让我们对搜索到的单词进行词干和小写:

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']
现在,我们可以对上述内容进行改进,包括堵塞:

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3    []
4    [I have snakes]
Name: text, dtype: object

如果只需要子字符串匹配,请确保搜索的单词是单数而不是复数

 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
                                   for w2 in searched_words])
                                ])
 )

顺便说一句,我可能会在这里创建一个带有常规for循环的函数,这个带有列表理解的lambda已经失控了。

请发布您的
str_df
,以及您所需的输出。@JulienMarrec已编辑。谢谢。你的第三个例子是
,因为
有两句话,这似乎是你想要共同引用的解决方案,这并不容易。如果您只想提取句子,则更容易(例如,文本以!?)分隔)。此外,请显示您当前的输出,即使它是错误的。谢谢这工作。是的,甚至我也遇到了类似“蛇是有毒的,Python也是。”这样的问题。我希望输出是[蛇是有毒的],但我得到的是[蛇是有毒的,Python也是],因为句子开头没有空格。即使我在单词列表中给出了“蛇”,我也能得到带有“蛇”的句子吗。我所需要的只是子字符串与指定的单词相匹配,这样我就不会丢失任何数据来分析上下文。词干分析是一个可能的解决方案,但问题陈述不仅仅处理结构化数据,而且我有一个网络爬虫,它可以获取数据,所以我说的是大数据(GB),所以它有像“Snakess”这样的数据,或者像“snakesare”这样的组合词,所以当我在文本中搜索蛇时,它也应该返回带有“snakesare”的句子!!底线是我希望子字符串匹配发生。谢谢你的回复。我相信你能找到答案,但我添加了一个解决方案。不过谢谢你,朱利安,我会调查的!