Python 加快数据帧中的单词计数_Python_Pandas_Performance_Optimization_Time Complexity

Python 加快数据帧中的单词计数

python pandas performance optimization time-complexity

Python 加快数据帧中的单词计数,python,pandas,performance,optimization,time-complexity,Python,Pandas,Performance,Optimization,Time Complexity,我有一列数据帧df['psych_prob']，还有一本包含不同类别及其对应单词列表的词典。我必须计算数据帧列中属于每个类别的单词出现的次数下面的代码适用于我，但我的实际数据集超过10万行，实际类别超过40个，每个类别中的单词超过500个。我花了1个多小时来运行代码。我试图优化以下代码的速度 dummy_dict={ 'psych_prob': ['he would be happy about it, but i am sad it does

我有一列数据帧df['psych_prob']，还有一本包含不同类别及其对应单词列表的词典。我必须计算数据帧列中属于每个类别的单词出现的次数

下面的代码适用于我，但我的实际数据集超过10万行，实际类别超过40个，每个类别中的单词超过500个。我花了1个多小时来运行代码。我试图优化以下代码的速度

dummy_dict={
            'psych_prob':
            ['he would be happy about it, but i am sad it does not make sense to her', 
               'how i can be happy i am anxious all the time my fathers death is here']
           }
df=pd.DataFrame(dummy_dict)

# dictionary containing categories and their list of words
category_dict={'pronouns':['he', "he'd", "he's", 'her', 'hers'],
               'neg_emo':['sad','anxious','angry'], 
               'pos_emo':['happy','excited']}

for category in category_dict:
    #join the list of words by space and pipe
    category_joined=' |'.join(e for e in category_dict[category]) 

    #count how many times the list of words belonging to the category appears in the dataframe
    category_count=df.psych_prob.str.count(category_joined).sum()

    print('Category:',category) 
    print('Words to search:',category_joined)
    print('Total words of this category in dataframe:', category_count)
    print('\n')

编辑：这是一个解决方案，其中包括单词边界和将列的值连接到一个长句中：

import re

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    
    pat = '|'.join(r"\b{}\b".format(x) for x in vals)
    category_count = len(re.findall(pat, str_vals))
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

我认为您可以使用Aho Corasick频率计数器，通过一个长字符串的

join

by column来提高性能：

#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    #join the list of words by space and pipe
    category_count = ac_frequency(vals, str_vals)
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

编辑：这是一个解决方案，其中包括单词边界和将列的值连接到一个长句中：

import re

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    
    pat = '|'.join(r"\b{}\b".format(x) for x in vals)
    category_count = len(re.findall(pat, str_vals))
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

我认为您可以使用Aho Corasick频率计数器，通过一个长字符串的

join

by column来提高性能：

#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    #join the list of words by space and pipe
    category_count = ac_frequency(vals, str_vals)
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

谢谢你的建议和书面代码。然而，它并没有给我准确的答案。对于eg；对于代词，它给出了['he'、'he'd'、'he's'、'her'、'hers']类别：代词在数据框架中这一类别的所有单词：[5,0,0,3,1]应该是[1,0,0,2,1]@你能说得更具体一点吗？我的意思是它并没有精确地匹配单词。他数了5次，但只出现了1次。它数了她3次，但只出现了2次。@Noor-hmmm，一件事-在真正的类别中是带空格的单词吗？非常喜欢

？或者所有单词都没有像样本数据中那样的空格？对不起，你是对的，我的解决方案是不正确的，因为例如：它会在“…我父亲的死…”中找到“她的”。即使它不应该。谢谢你帮助我理解我的错误谢谢你的建议和书面代码。然而，它并没有给我准确的答案。对于eg；对于代词，它给出了['he'、'he'd'、'he's'、'her'、'hers']类别：代词在数据框架中这一类别的所有单词：[5,0,0,3,1]应该是[1,0,0,2,1]@你能说得更具体一点吗？我的意思是它并没有精确地匹配单词。他数了5次，但只出现了1次。它数了她3次，但只出现了2次。@Noor-hmmm，一件事-在真正的类别中是带空格的单词吗？非常喜欢

？或者所有单词都没有像样本数据中那样的空格？对不起，你是对的，我的解决方案是不正确的，因为例如：它会在“…我父亲的死…”中找到“她的”。即使它不应该。谢谢你帮我理解我的错误