Python 删除停止词并仅选择熊猫中的名称_Python_Regex_Pandas_Text Classification

Python 删除停止词并仅选择熊猫中的名称

python regex pandas

Python 删除停止词并仅选择熊猫中的名称,python,regex,pandas,text-classification,Python,Regex,Pandas,Text Classification,我试图按日期提取最重要的单词，如下所示： df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date') 在以下数据帧中： import pandas as pd # initialize data = [['20/05', "So many books, s

我试图按日期提取最重要的单词，如下所示：

df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')

在以下数据帧中：

import pandas as pd 

# initialize 
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05', 
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])

您可以看到，有许多停止词（

“the”、“an”、“a”、“be”、…

），我想删除它们，以便有更好的选择。我的目标是在日期前找到一些共同的关键词，即模式，这样我会更感兴趣，更关注名称而不是动词

你知道我怎样才能去掉停止词，只保留名字吗

编辑

预期产出（基于以下Vaibhav Khandelwal回答的结果）：

我只需要提取名词（原因应该更频繁，以便根据频率排序）

我认为它应该是有用的

nltk.pos_标签标签在（'NN'）的位置
 这是如何从文本中删除停止字：
import nltk
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = stopwords.words('english')
    fresh_text = []

    for i in text.lower().split():
        if i not in stop_words:
            fresh_text.append(i)

    return(' '.join(fresh_text))

df['text'] = df['Quotes'].apply(remove_stopwords)

注意：如果要删除单词，请在stopwords列表中明确追加

对于另一半，您可以添加另一个函数来提取名词：
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
    if i[1].startswith('NN'):
        result.append(i[0])

return(', '.join(result))

df['NOUN']=df['text'].apply（提取名词）
最终结果如下：
检查NLP--NLTK这是否回答了您的问题？从nltk.corpus导入停止词
停止词列表=停止词。单词（'english'）
df['Quotes']。应用（lambda x:[如果项目不在停止词列表中，则针对x中的项目]）
我得到了这个错误：TypeError:“float”对象是不可编辑的
但是我不知道如何只保留引号中的名称。您期望的输出是什么？它只是从'Quotes'列中删除停止字吗？upi也可以像您发布InputHank you@Vaibhav Khandelwal一样发布问题中的预期输出。你的解决方案部分回答了我的问题，但它是好的和有帮助的，所以我投它的票
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
    if i[1].startswith('NN'):
        result.append(i[0])

return(', '.join(result))