Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/.htaccess/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 加快数据帧中的单词计数_Python_Pandas_Performance_Optimization_Time Complexity - Fatal编程技术网

Python 加快数据帧中的单词计数

Python 加快数据帧中的单词计数,python,pandas,performance,optimization,time-complexity,Python,Pandas,Performance,Optimization,Time Complexity,我有一列数据帧df['psych_prob'],还有一本包含不同类别及其对应单词列表的词典。我必须计算数据帧列中属于每个类别的单词出现的次数 下面的代码适用于我,但我的实际数据集超过10万行,实际类别超过40个,每个类别中的单词超过500个。我花了1个多小时来运行代码。我试图优化以下代码的速度 dummy_dict={ 'psych_prob': ['he would be happy about it, but i am sad it does

我有一列数据帧df['psych_prob'],还有一本包含不同类别及其对应单词列表的词典。我必须计算数据帧列中属于每个类别的单词出现的次数

下面的代码适用于我,但我的实际数据集超过10万行,实际类别超过40个,每个类别中的单词超过500个。我花了1个多小时来运行代码。我试图优化以下代码的速度

dummy_dict={
            'psych_prob':
            ['he would be happy about it, but i am sad it does not make sense to her', 
               'how i can be happy i am anxious all the time my fathers death is here']
           }
df=pd.DataFrame(dummy_dict)

# dictionary containing categories and their list of words
category_dict={'pronouns':['he', "he'd", "he's", 'her', 'hers'],
               'neg_emo':['sad','anxious','angry'], 
               'pos_emo':['happy','excited']}

for category in category_dict:
    #join the list of words by space and pipe
    category_joined=' |'.join(e for e in category_dict[category]) 

    #count how many times the list of words belonging to the category appears in the dataframe
    category_count=df.psych_prob.str.count(category_joined).sum()

    print('Category:',category) 
    print('Words to search:',category_joined)
    print('Total words of this category in dataframe:', category_count)
    print('\n')

编辑:这是一个解决方案,其中包括单词边界和将列的值连接到一个长句中:

import re

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    
    pat = '|'.join(r"\b{}\b".format(x) for x in vals)
    category_count = len(re.findall(pat, str_vals))
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')
    
我认为您可以使用Aho Corasick频率计数器,通过一个长字符串的
join
by column来提高性能:

#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    #join the list of words by space and pipe
    category_count = ac_frequency(vals, str_vals)
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

编辑:这是一个解决方案,其中包括单词边界和将列的值连接到一个长句中:

import re

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    
    pat = '|'.join(r"\b{}\b".format(x) for x in vals)
    category_count = len(re.findall(pat, str_vals))
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')
    
我认为您可以使用Aho Corasick频率计数器,通过一个长字符串的
join
by column来提高性能:

#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
    frequencies = [0] * len(needles)
    # Make a searcher
    searcher = ahocorasick.Automaton()
    for i, needle in enumerate(needles):
        searcher.add_word(needle, i)
    searcher.make_automaton()
    # Add up all frequencies
    for _, i in searcher.iter(haystack):
        frequencies[i] += 1
    return frequencies

#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])

for category, vals in category_dict.items():
    #join the list of words by space and pipe
    category_count = ac_frequency(vals, str_vals)
    
    print('Category:',category) 
    print('Total words of this category in dataframe:', category_count)
    print('\n')

谢谢你的建议和书面代码。然而,它并没有给我准确的答案。对于eg;对于代词,它给出了['he'、'he'd'、'he's'、'her'、'hers']类别:代词在数据框架中这一类别的所有单词:[5,0,0,3,1]应该是[1,0,0,2,1]@你能说得更具体一点吗?我的意思是它并没有精确地匹配单词。他数了5次,但只出现了1次。它数了她3次,但只出现了2次。@Noor-hmmm,一件事-在真正的类别中是带空格的单词吗?非常喜欢
?或者所有单词都没有像样本数据中那样的空格?对不起,你是对的,我的解决方案是不正确的,因为例如:它会在“…我父亲的死…”中找到“她的”。即使它不应该。谢谢你帮助我理解我的错误谢谢你的建议和书面代码。然而,它并没有给我准确的答案。对于eg;对于代词,它给出了['he'、'he'd'、'he's'、'her'、'hers']类别:代词在数据框架中这一类别的所有单词:[5,0,0,3,1]应该是[1,0,0,2,1]@你能说得更具体一点吗?我的意思是它并没有精确地匹配单词。他数了5次,但只出现了1次。它数了她3次,但只出现了2次。@Noor-hmmm,一件事-在真正的类别中是带空格的单词吗?非常喜欢
?或者所有单词都没有像样本数据中那样的空格?对不起,你是对的,我的解决方案是不正确的,因为例如:它会在“…我父亲的死…”中找到“她的”。即使它不应该。谢谢你帮我理解我的错误