Python 加快数据帧中的单词计数
我有一列数据帧df['psych_prob'],还有一本包含不同类别及其对应单词列表的词典。我必须计算数据帧列中属于每个类别的单词出现的次数 下面的代码适用于我,但我的实际数据集超过10万行,实际类别超过40个,每个类别中的单词超过500个。我花了1个多小时来运行代码。我试图优化以下代码的速度Python 加快数据帧中的单词计数,python,pandas,performance,optimization,time-complexity,Python,Pandas,Performance,Optimization,Time Complexity,我有一列数据帧df['psych_prob'],还有一本包含不同类别及其对应单词列表的词典。我必须计算数据帧列中属于每个类别的单词出现的次数 下面的代码适用于我,但我的实际数据集超过10万行,实际类别超过40个,每个类别中的单词超过500个。我花了1个多小时来运行代码。我试图优化以下代码的速度 dummy_dict={ 'psych_prob': ['he would be happy about it, but i am sad it does
dummy_dict={
'psych_prob':
['he would be happy about it, but i am sad it does not make sense to her',
'how i can be happy i am anxious all the time my fathers death is here']
}
df=pd.DataFrame(dummy_dict)
# dictionary containing categories and their list of words
category_dict={'pronouns':['he', "he'd", "he's", 'her', 'hers'],
'neg_emo':['sad','anxious','angry'],
'pos_emo':['happy','excited']}
for category in category_dict:
#join the list of words by space and pipe
category_joined=' |'.join(e for e in category_dict[category])
#count how many times the list of words belonging to the category appears in the dataframe
category_count=df.psych_prob.str.count(category_joined).sum()
print('Category:',category)
print('Words to search:',category_joined)
print('Total words of this category in dataframe:', category_count)
print('\n')
编辑:这是一个解决方案,其中包括单词边界和将列的值连接到一个长句中:
import re
#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])
for category, vals in category_dict.items():
pat = '|'.join(r"\b{}\b".format(x) for x in vals)
category_count = len(re.findall(pat, str_vals))
print('Category:',category)
print('Total words of this category in dataframe:', category_count)
print('\n')
我认为您可以使用Aho Corasick频率计数器,通过一个长字符串的join
by column来提高性能:
#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
frequencies = [0] * len(needles)
# Make a searcher
searcher = ahocorasick.Automaton()
for i, needle in enumerate(needles):
searcher.add_word(needle, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(haystack):
frequencies[i] += 1
return frequencies
#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])
for category, vals in category_dict.items():
#join the list of words by space and pipe
category_count = ac_frequency(vals, str_vals)
print('Category:',category)
print('Total words of this category in dataframe:', category_count)
print('\n')
编辑:这是一个解决方案,其中包括单词边界和将列的值连接到一个长句中:
import re
#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])
for category, vals in category_dict.items():
pat = '|'.join(r"\b{}\b".format(x) for x in vals)
category_count = len(re.findall(pat, str_vals))
print('Category:',category)
print('Total words of this category in dataframe:', category_count)
print('\n')
我认为您可以使用Aho Corasick频率计数器,通过一个长字符串的join
by column来提高性能:
#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
frequencies = [0] * len(needles)
# Make a searcher
searcher = ahocorasick.Automaton()
for i, needle in enumerate(needles):
searcher.add_word(needle, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(haystack):
frequencies[i] += 1
return frequencies
#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])
for category, vals in category_dict.items():
#join the list of words by space and pipe
category_count = ac_frequency(vals, str_vals)
print('Category:',category)
print('Total words of this category in dataframe:', category_count)
print('\n')
谢谢你的建议和书面代码。然而,它并没有给我准确的答案。对于eg;对于代词,它给出了['he'、'he'd'、'he's'、'her'、'hers']类别:代词在数据框架中这一类别的所有单词:[5,0,0,3,1]应该是[1,0,0,2,1]@你能说得更具体一点吗?我的意思是它并没有精确地匹配单词。他数了5次,但只出现了1次。它数了她3次,但只出现了2次。@Noor-hmmm,一件事-在真正的类别中是带空格的单词吗?非常喜欢
?或者所有单词都没有像样本数据中那样的空格?对不起,你是对的,我的解决方案是不正确的,因为例如:它会在“…我父亲的死…”中找到“她的”。即使它不应该。谢谢你帮助我理解我的错误谢谢你的建议和书面代码。然而,它并没有给我准确的答案。对于eg;对于代词,它给出了['he'、'he'd'、'he's'、'her'、'hers']类别:代词在数据框架中这一类别的所有单词:[5,0,0,3,1]应该是[1,0,0,2,1]@你能说得更具体一点吗?我的意思是它并没有精确地匹配单词。他数了5次,但只出现了1次。它数了她3次,但只出现了2次。@Noor-hmmm,一件事-在真正的类别中是带空格的单词吗?非常喜欢
?或者所有单词都没有像样本数据中那样的空格?对不起,你是对的,我的解决方案是不正确的,因为例如:它会在“…我父亲的死…”中找到“她的”。即使它不应该。谢谢你帮我理解我的错误