Python 使用特定单词提取句子_Python_Pandas_Nltk

Python 使用特定单词提取句子

python pandas

Python 使用特定单词提取句子,python,pandas,nltk,Python,Pandas,Nltk,我有一个带有文本列的excel文件。我所需要做的就是从文本列中为每一行提取带有特定单词的句子我尝试过使用定义函数 import pandas as pd from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize #################Reading in excel file##################### str_df = pd.read_excel("C:\\Us

我有一个带有文本列的excel文件。我所需要做的就是从文本列中为每一行提取带有特定单词的句子

我尝试过使用定义函数

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
    sentences=sent_tokenize(text)
    return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

但是，如果我必须找到包含多个特定单词的句子，例如

蛇

，

毒液

，

蟒蛇

，有人能帮我吗。这个句子应该至少有一个单词。我无法使用

nltk。使用多个单词标记
要搜索的words=[“蛇”、“毒蛇”、“蟒蛇”]

输入Excel文件：
                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.

所需输出：
                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.

名为Context的列追加到上面的文本列。上下文列应类似于：
 1.  [Snakes are venomous.] [Anaconda is venomous.]
 2.  [Anaconda lives in Amazon.] [It is venomous.]
 3.  [Snakes,snakes,snakes everywhere!] [The least I expect is an    anaconda.Because it is venomous.]
 4.  NULL

提前谢谢
 以下是方法：
In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

您会发现有两个问题，因为sent\u标记器
由于标点符号而无法正常工作

更新：处理复数
以下是更新的df：
text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes


df = pd.read_clipboard(sep='0')

我们可以使用词干分析器（），例如
首先，让我们对搜索到的单词进行词干和小写：
searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

现在，我们可以对上述内容进行改进，包括堵塞：
print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3    []
4    [I have snakes]
Name: text, dtype: object


如果只需要子字符串匹配，请确保搜索的单词是单数而不是复数
 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
                                   for w2 in searched_words])
                                ])
 )

顺便说一句，我可能会在这里创建一个带有常规for循环的函数，这个带有列表理解的lambda已经失控了。
请发布您的str_df
，以及您所需的输出。@JulienMarrec已编辑。谢谢。你的第三个例子是，因为有两句话，这似乎是你想要共同引用的解决方案，这并不容易。如果您只想提取句子，则更容易（例如，文本以！？）分隔）。此外，请显示您当前的输出，即使它是错误的。谢谢这工作。是的，甚至我也遇到了类似“蛇是有毒的，Python也是。”这样的问题。我希望输出是[蛇是有毒的]，但我得到的是[蛇是有毒的，Python也是]，因为句子开头没有空格。即使我在单词列表中给出了“蛇”，我也能得到带有“蛇”的句子吗。我所需要的只是子字符串与指定的单词相匹配，这样我就不会丢失任何数据来分析上下文。词干分析是一个可能的解决方案，但问题陈述不仅仅处理结构化数据，而且我有一个网络爬虫，它可以获取数据，所以我说的是大数据（GB），所以它有像“Snakess”这样的数据，或者像“snakesare”这样的组合词，所以当我在文本中搜索蛇时，它也应该返回带有“snakesare”的句子！！底线是我希望子字符串匹配发生。谢谢你的回复。我相信你能找到答案，但我添加了一个解决方案。不过谢谢你，朱利安，我会调查的！