Python 基于标题列表创建聚合列_Python_Pandas_Function_For Loop_Aggregate

Python 基于标题列表创建聚合列

python pandas function for-loop

Python 基于标题列表创建聚合列,python,pandas,function,for-loop,aggregate,Python,Pandas,Function,For Loop,Aggregate,我有一个包含调查数据的数据框。它还包含其他几个列，其中包括人口统计数据（如年龄、部门等）和带有评级的列。希望根据评级列的计算向数据框添加一些列添加列的目的是提供a）获得有利响应的计数b）获得有利响应的百分比（有利响应的数量/该系数中项目的数量）c）获得有利响应的系数级百分比（如果存在属于该系数的具有NaN的任何项目，则为NaN）下表显示了如何将其应用于指导因素的示例我想将这一点推广到其他因素，如多样性、领导力和参与度 Coach_q1 Coach_q2 Coach_q8

我有一个包含调查数据的数据框。它还包含其他几个列，其中包括人口统计数据（如年龄、部门等）和带有评级的列。希望根据评级列的计算向数据框添加一些列

添加列的目的是提供a）获得有利响应的计数b）获得有利响应的百分比（有利响应的数量/该系数中项目的数量）c）获得有利响应的系数级百分比（如果存在属于该系数的具有NaN的任何项目，则为NaN）下表显示了如何将其应用于指导因素的示例我想将这一点推广到其他因素，如多样性、领导力和参与度

Coach_q1  Coach_q2      Coach_q8    coach_favcount   coach_fav_perc   coach_agg_perc
Favourable   Neutral    Favourable   2                  66.6%          66.6%
Favourable   Favourable NaN          2                  100%           NaN
Favourable   Favourable Unfavourable 2                  66.6%          66.6%  
NaN          NaN        Unfavourable 0                  0%             NaN

我已经使用了下面的代码，它是有效的，但是，我只能得到fav_count列和fav_perc列用于指导。希望a）获得_agg_perc列，b）将其应用于所有其他因素

#Get the Coaching Columns
coaching_agg = df.loc[:, df.columns.str.contains('Coaching_')] 

#Create a column to store the number of favourable responses
df['coaching_fav_count'] = df[coaching_cols == 'Favourable'].notna().sum(axis=1)

#create a column to store the percentage of favourable responses
df['coaching_fav_perc'] = df['coaching_fav'] / len(coaching_agg.columns)

我猜for循环背后的逻辑是a）创建一个评级列列表（见下面的代码），b）创建一个函数来计算计数、有利响应的百分比，在项目级别查找NaN的存在，以及c）创建一个for循环来将该函数应用于评级列

#Create a list made up of rating cols
ratingcollist = ['Coaching_','Communication_','Development_','Diversity_','Engagement_']

ratingcols = df.loc[:, df.columns.str.contains('|'.join(ratingcollist))]

感谢任何形式的帮助，我可以得到，谢谢你

我相信您需要分别处理列表的每个值：

df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', 'Favourable', 'nan'], 
                   'Coach_q2': ['Neutral', 'Favourable', 'Favourable', 'NaN'], 
                   'Coach_q8': ['Favourable', 'nan', 'Unfavourable', 'Unfavourable']})
    
print (df)
     Coach_q1    Coach_q2      Coach_q8
0  Favourable     Neutral    Favourable
1  Favourable  Favourable           nan
2  Favourable  Favourable  Unfavourable
3         nan         NaN  Unfavourable

#replace nan and NaN strings to missing values
df = df.replace(['nan','NaN'], np.nan)

ratingcollist = ['Coach_','Communication_','Development_','Diversity_','Engagement_']

for rat in ratingcollist:
    #filter columns by substrings
    cols = df.filter(like=rat).columns

    #mask for no missing values
    mask = df[cols].notna().all(axis=1)
    
    #create new columns if match
    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
        df.loc[mask, f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)

如果将

fav_perc

的

nan

s替换为单词缺失输出错误，则第二个值应为

，因为计数不包括missing值：

df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', 'Favourable', 'nan'], 
                   'Coach_q2': ['Neutral', 'Favourable', 'Favourable', 'NaN'], 
                   'Coach_q8': ['Favourable', 'nan', 'Unfavourable', 'Unfavourable']})
    
print (df)
     Coach_q1    Coach_q2      Coach_q8
0  Favourable     Neutral    Favourable
1  Favourable  Favourable           nan
2  Favourable  Favourable  Unfavourable
3         nan         NaN  Unfavourable

df = df.replace(['nan','NaN'], 'Missing')
print (df)
     Coach_q1    Coach_q2      Coach_q8
0  Favourable     Neutral    Favourable
1  Favourable  Favourable       Missing
2  Favourable  Favourable  Unfavourable
3     Missing     Missing  Unfavourable

因此，如果想要使用

缺失

是必要的，则将

计数

更改为

总和

，比较不等于

缺失

：

#create a list of all the rating columns
ratingcollist = ['Coach_','Diversity', 'Leadership', 'Engagement']


#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
    cols = df.filter(like=rat).columns
    mask = (df[cols] != 'Missing').all(axis=1)
    
#create 3 new columns for each factor, one for count of Favourable responses, 
#one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses

    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].ne('Missing').sum(axis=1)
        df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)

我们可以尝试不使用循环：

columns_split = df.columns.str.split('_')
count = (df.set_axis(pd.MultiIndex.from_tuples(map(tuple, columns_split)), axis=1)
           .stack()
           .eq('Favourable')
           .sum(level=0))

s = columns_split.str[0].to_series().add('_%Fav')

new_df = (df.join(count.add_suffix('_FavCount'))
           .join(count.add_suffix('_%Fav').div(s.value_counts()))
         )

print(new_df)

输出

  Coaching_q1 Coaching_q2 Diversity_q1 Diversity_q2  Coaching_FavCount  \
0  Favourable     Neutral   Favourable   Favourable                1.0   
1  Favourable  Favourable   Favourable    Favourble                2.0   
2         NaN  Favourable          NaN          NaN                1.0   

   Diversity_FavCount  Coaching_%Fav  Diversity_%Fav  
0                 2.0            0.5             1.0  
1                 1.0            1.0             0.5  
2                 0.0            0.5             0.0

通过将列中的NaN值重新编码为“缺失”并应用@jezrael建议的掩码，问题已得到解决

#create a list of all the rating columns
ratingcollist = ['Coaching_','Diversity', 'Leadership', 'Engagement']


#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
    cols = df.filter(like=rat).columns
    mask = (df[cols] != 'Missing').all(axis=1)
    
#create 3 new columns for each factor, one for count of Favourable responses, one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses

    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
        df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)

没有并发症：）在我看来，它简单直观，所以upvoted@jezrael如果我想创建后缀为_agg_perc的聚合列，它仅在没有缺失值的情况下计算有利响应的百分比（例如，第2行中两个因子的agg_perc列百分比均为零），我该怎么做呢？@jezrael yup done@wjie08-答案被编辑为正确使用

NaN

s和

Missing

。似乎无法得到结果。ValueError:无法在没有重叠索引名的情况下联接

print (df)
     Coach_q1    Coach_q2      Coach_q8  coach_fav_count  coach_fav_perc  \
0  Favourable     Neutral    Favourable                2        0.666667   
1  Favourable  Favourable       Missing                2        1.000000   
2  Favourable  Favourable  Unfavourable                2        0.666667   
3     Missing     Missing  Unfavourable                0        0.000000   

   coach_agg_perc  
0        0.666667  
1             NaN  
2        0.666667  
3             NaN

columns_split = df.columns.str.split('_')
count = (df.set_axis(pd.MultiIndex.from_tuples(map(tuple, columns_split)), axis=1)
           .stack()
           .eq('Favourable')
           .sum(level=0))

s = columns_split.str[0].to_series().add('_%Fav')

new_df = (df.join(count.add_suffix('_FavCount'))
           .join(count.add_suffix('_%Fav').div(s.value_counts()))
         )

print(new_df)

  Coaching_q1 Coaching_q2 Diversity_q1 Diversity_q2  Coaching_FavCount  \
0  Favourable     Neutral   Favourable   Favourable                1.0   
1  Favourable  Favourable   Favourable    Favourble                2.0   
2         NaN  Favourable          NaN          NaN                1.0   

   Diversity_FavCount  Coaching_%Fav  Diversity_%Fav  
0                 2.0            0.5             1.0  
1                 1.0            1.0             0.5  
2                 0.0            0.5             0.0

#create a list of all the rating columns
ratingcollist = ['Coaching_','Diversity', 'Leadership', 'Engagement']


#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
    cols = df.filter(like=rat).columns
    mask = (df[cols] != 'Missing').all(axis=1)
    
#create 3 new columns for each factor, one for count of Favourable responses, one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses

    if len(cols) > 0:
        df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
        df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
        df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)