Python 基于标题列表创建聚合列
我有一个包含调查数据的数据框。它还包含其他几个列,其中包括人口统计数据(如年龄、部门等)和带有评级的列。希望根据评级列的计算向数据框添加一些列 添加列的目的是提供a)获得有利响应的计数b)获得有利响应的百分比(有利响应的数量/该系数中项目的数量)c)获得有利响应的系数级百分比(如果存在属于该系数的具有NaN的任何项目,则为NaN) 下表显示了如何将其应用于指导因素的示例 我想将这一点推广到其他因素,如多样性、领导力和参与度Python 基于标题列表创建聚合列,python,pandas,function,for-loop,aggregate,Python,Pandas,Function,For Loop,Aggregate,我有一个包含调查数据的数据框。它还包含其他几个列,其中包括人口统计数据(如年龄、部门等)和带有评级的列。希望根据评级列的计算向数据框添加一些列 添加列的目的是提供a)获得有利响应的计数b)获得有利响应的百分比(有利响应的数量/该系数中项目的数量)c)获得有利响应的系数级百分比(如果存在属于该系数的具有NaN的任何项目,则为NaN) 下表显示了如何将其应用于指导因素的示例 我想将这一点推广到其他因素,如多样性、领导力和参与度 Coach_q1 Coach_q2 Coach_q8
Coach_q1 Coach_q2 Coach_q8 coach_favcount coach_fav_perc coach_agg_perc
Favourable Neutral Favourable 2 66.6% 66.6%
Favourable Favourable NaN 2 100% NaN
Favourable Favourable Unfavourable 2 66.6% 66.6%
NaN NaN Unfavourable 0 0% NaN
我已经使用了下面的代码,它是有效的,但是,我只能得到fav_count列和fav_perc列用于指导。希望a)获得_agg_perc列,b)将其应用于所有其他因素
#Get the Coaching Columns
coaching_agg = df.loc[:, df.columns.str.contains('Coaching_')]
#Create a column to store the number of favourable responses
df['coaching_fav_count'] = df[coaching_cols == 'Favourable'].notna().sum(axis=1)
#create a column to store the percentage of favourable responses
df['coaching_fav_perc'] = df['coaching_fav'] / len(coaching_agg.columns)
我猜for循环背后的逻辑是a)创建一个评级列列表(见下面的代码),b)创建一个函数来计算计数、有利响应的百分比,在项目级别查找NaN的存在,以及c)创建一个for循环来将该函数应用于评级列
#Create a list made up of rating cols
ratingcollist = ['Coaching_','Communication_','Development_','Diversity_','Engagement_']
ratingcols = df.loc[:, df.columns.str.contains('|'.join(ratingcollist))]
感谢任何形式的帮助,我可以得到,谢谢你 我相信您需要分别处理列表的每个值:
df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', 'Favourable', 'nan'],
'Coach_q2': ['Neutral', 'Favourable', 'Favourable', 'NaN'],
'Coach_q8': ['Favourable', 'nan', 'Unfavourable', 'Unfavourable']})
print (df)
Coach_q1 Coach_q2 Coach_q8
0 Favourable Neutral Favourable
1 Favourable Favourable nan
2 Favourable Favourable Unfavourable
3 nan NaN Unfavourable
#replace nan and NaN strings to missing values
df = df.replace(['nan','NaN'], np.nan)
ratingcollist = ['Coach_','Communication_','Development_','Diversity_','Engagement_']
for rat in ratingcollist:
#filter columns by substrings
cols = df.filter(like=rat).columns
#mask for no missing values
mask = df[cols].notna().all(axis=1)
#create new columns if match
if len(cols) > 0:
df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
df.loc[mask, f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)
如果将
fav_perc
的nan
s替换为单词缺失输出错误,则第二个值应为1
,因为计数不包括missing值:
df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', 'Favourable', 'nan'],
'Coach_q2': ['Neutral', 'Favourable', 'Favourable', 'NaN'],
'Coach_q8': ['Favourable', 'nan', 'Unfavourable', 'Unfavourable']})
print (df)
Coach_q1 Coach_q2 Coach_q8
0 Favourable Neutral Favourable
1 Favourable Favourable nan
2 Favourable Favourable Unfavourable
3 nan NaN Unfavourable
df = df.replace(['nan','NaN'], 'Missing')
print (df)
Coach_q1 Coach_q2 Coach_q8
0 Favourable Neutral Favourable
1 Favourable Favourable Missing
2 Favourable Favourable Unfavourable
3 Missing Missing Unfavourable
因此,如果想要使用
缺失
是必要的,则将计数
更改为总和
,比较不等于缺失
:
#create a list of all the rating columns
ratingcollist = ['Coach_','Diversity', 'Leadership', 'Engagement']
#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
cols = df.filter(like=rat).columns
mask = (df[cols] != 'Missing').all(axis=1)
#create 3 new columns for each factor, one for count of Favourable responses,
#one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses
if len(cols) > 0:
df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].ne('Missing').sum(axis=1)
df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)
我们可以尝试不使用循环:
columns_split = df.columns.str.split('_')
count = (df.set_axis(pd.MultiIndex.from_tuples(map(tuple, columns_split)), axis=1)
.stack()
.eq('Favourable')
.sum(level=0))
s = columns_split.str[0].to_series().add('_%Fav')
new_df = (df.join(count.add_suffix('_FavCount'))
.join(count.add_suffix('_%Fav').div(s.value_counts()))
)
print(new_df)
输出
Coaching_q1 Coaching_q2 Diversity_q1 Diversity_q2 Coaching_FavCount \
0 Favourable Neutral Favourable Favourable 1.0
1 Favourable Favourable Favourable Favourble 2.0
2 NaN Favourable NaN NaN 1.0
Diversity_FavCount Coaching_%Fav Diversity_%Fav
0 2.0 0.5 1.0
1 1.0 1.0 0.5
2 0.0 0.5 0.0
通过将列中的NaN值重新编码为“缺失”并应用@jezrael建议的掩码,问题已得到解决
#create a list of all the rating columns
ratingcollist = ['Coaching_','Diversity', 'Leadership', 'Engagement']
#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
cols = df.filter(like=rat).columns
mask = (df[cols] != 'Missing').all(axis=1)
#create 3 new columns for each factor, one for count of Favourable responses, one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses
if len(cols) > 0:
df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)
没有并发症:)在我看来,它简单直观,所以upvoted@jezrael如果我想创建后缀为_agg_perc的聚合列,它仅在没有缺失值的情况下计算有利响应的百分比(例如,第2行中两个因子的agg_perc列百分比均为零),我该怎么做呢?@jezrael yup done@wjie08-答案被编辑为正确使用
NaN
s和Missing
。似乎无法得到结果。ValueError:无法在没有重叠索引名的情况下联接
print (df)
Coach_q1 Coach_q2 Coach_q8 coach_fav_count coach_fav_perc \
0 Favourable Neutral Favourable 2 0.666667
1 Favourable Favourable Missing 2 1.000000
2 Favourable Favourable Unfavourable 2 0.666667
3 Missing Missing Unfavourable 0 0.000000
coach_agg_perc
0 0.666667
1 NaN
2 0.666667
3 NaN
columns_split = df.columns.str.split('_')
count = (df.set_axis(pd.MultiIndex.from_tuples(map(tuple, columns_split)), axis=1)
.stack()
.eq('Favourable')
.sum(level=0))
s = columns_split.str[0].to_series().add('_%Fav')
new_df = (df.join(count.add_suffix('_FavCount'))
.join(count.add_suffix('_%Fav').div(s.value_counts()))
)
print(new_df)
Coaching_q1 Coaching_q2 Diversity_q1 Diversity_q2 Coaching_FavCount \
0 Favourable Neutral Favourable Favourable 1.0
1 Favourable Favourable Favourable Favourble 2.0
2 NaN Favourable NaN NaN 1.0
Diversity_FavCount Coaching_%Fav Diversity_%Fav
0 2.0 0.5 1.0
1 1.0 1.0 0.5
2 0.0 0.5 0.0
#create a list of all the rating columns
ratingcollist = ['Coaching_','Diversity', 'Leadership', 'Engagement']
#create a for loop to get all the columns that match the column list keyword
for rat in ratingcollist:
cols = df.filter(like=rat).columns
mask = (df[cols] != 'Missing').all(axis=1)
#create 3 new columns for each factor, one for count of Favourable responses, one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses
if len(cols) > 0:
df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1)
df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1)
df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols)