Python 如何从数据框列的列表中计算单词的频率？_Python_Pandas_Dataframe_Frequency_Word

Python 如何从数据框列的列表中计算单词的频率？

python pandas dataframe

Python 如何从数据框列的列表中计算单词的频率？,python,pandas,dataframe,frequency,word,Python,Pandas,Dataframe,Frequency,Word,如果我有一个具有以下布局的数据帧： ID# Response 1234 Covid-19 was a disaster for my business 3456 The way you handled this pandemic was awesome 我希望能够从列表中计算特定单词的频率 list=['covid','COVID','Covid-19','pandemic','coronavirus'] 最后，我想生成一个字典，如下所示 {covid:0,COVI

如果我有一个具有以下布局的数据帧：

ID#      Response
1234     Covid-19 was a disaster for my business
3456     The way you handled this pandemic was awesome

我希望能够从列表中计算特定单词的频率

list=['covid','COVID','Covid-19','pandemic','coronavirus']

最后，我想生成一个字典，如下所示

{covid:0,COVID:0,Covid-19:1,pandemic:1,'coronavirus':0}

请帮助我，我真的被困在如何用python编写代码上了。对于每个字符串，查找匹配数

import pandas as pd
import numpy as np


df = pd.DataFrame({'sheet':['sheet1', 'sheet2', 'sheet3', 'sheet2'],
    'tokenized_text':[['efcc', 'fficial', 'billiontwits', 'since', 'covid', 'landed'], ['when', 'people', 'say', 'the', 'fatality', 'rate', 'of', 'coronavirus', 'is'], ['in', 'the', 'coronavirus-induced', 'crisis', 'people', 'are',  'cyvbwx'], ['in', 'the', 'be-induced', 'crisis', 'people', 'are',  'cyvbwx']] })

print(df)

words_collection = ['covid','COVID','Covid-19','pandemic','coronavirus']

# Extract the words from all lines
all_words = []
for index, row in df.iterrows():
    all_words.extend(row['tokenized_text'])

# Create a dictionary that maps for each word from `words_collection` the counter it appears
word_to_number_of_occurences = dict()

# Go over the word collection and set it's counter
for word in words_collection:
    word_to_number_of_occurences[word] = all_words.count(word)

# {'covid': 1, 'COVID': 0, 'Covid-19': 0, 'pandemic': 0, 'coronavirus': 1}
print(word_to_number_of_occurences)

dict((s, df['response'].str.count(s).fillna(0).sum()) for s in list_of_strings)

请注意，Series.str.count接受正则表达式输入。您可能需要附加？=\b以获得正向先行词结尾

Series.str.count在计算NA时返回NA，因此用0填充。对于每个字符串，在列上求和。

对于每个字符串，查找匹配数

dict((s, df['response'].str.count(s).fillna(0).sum()) for s in list_of_strings)

请注意，Series.str.count接受正则表达式输入。您可能需要附加？=\b以获得正向先行词结尾

Series.str.count在计算NA时返回NA，因此用0填充。对于每个字符串，在列上求和。

尝试使用np.hstack和Counter：

尝试使用np.hstack和计数器：

你可以很简单地用理解的口述来做：

{x:df.Response.str.count(x).sum() for x in list}

输出

{'covid': 0, 'COVID': 0, 'Covid-19': 1, 'pandemic': 1, 'coronavirus': 0}

你可以很简单地用理解的口述来做：

{x:df.Response.str.count(x).sum() for x in list}

输出

{'covid': 0, 'COVID': 0, 'Covid-19': 1, 'pandemic': 1, 'coronavirus': 0}

给定一行输入文本covid covid covid，是三个还是一个？应为3给定一行输入文本covid covid covid，是三个还是一个？应为3优雅的短解决方案+1优雅的短解决方案+1