Python 从多列的值_计数中排除项
我得到了以下数据帧:Python 从多列的值_计数中排除项,python,pandas,Python,Pandas,我得到了以下数据帧: ae264e3637204a6fb9bb56bc8210ddfd ... 2906b810c7d4411798c6938adc9daaa5 1 not received ... not received 3 completed ... not received 5
ae264e3637204a6fb9bb56bc8210ddfd ... 2906b810c7d4411798c6938adc9daaa5
1 not received ... not received
3 completed ... not received
5 not received ... viewed
8 not received ... completed
12 not received ... not received
... ... ...
16995 not received ... not received
16996 not received ... not received
16997 not received ... not received
16998 completed ... not received
16999 not received ... not received
我应用value\u counts()
方法获取值的百分比,共有10列
我是这样做的:
overall = profile[relevant_columns].apply(lambda x: round(pd.Series.value_counts(x) / len(x), 4)* 100)
overall
结果:
ae264e3637204a6fb9bb56bc8210ddfd ... 2906b810c7d4411798c6938adc9daaa5
completed 21.22 ... 22.82
not received 62.47 ... 63.04
unresponsive 1.59 ... 9.29
viewed 14.73 ... 4.86
预期产出:
ae264e3637204a6fb9bb56bc8210ddfd ... 2906b810c7d4411798c6938adc9daaa5
completed 56.52 ... 61.82
unresponsive 4.23 ... 25.12
viewed 39.23 ... 13.14
但是,我不希望结果中出现“未收到”的百分比。我知道我可以在循环中删除每列的值,然后将table_counts()
应用到该列,但最好在一行中的多个列上保持apply
工作流。有人知道如何做到这一点吗?IIUC,您可以尝试相关列,然后使用a筛选出必要的行,然后分组到level=1
(列名)并使用normalize=True
获得value\u counts
,返回一个百分比,然后将其四舍五入并乘以100
overall = (profile[relevant_columns].stack().
loc[lambda x: x!='not received'].
groupby(level=1).value_counts(normalize=True).round(4).mul(100).unstack(0))
基于您的输入和相关输出的示例如下:
print(df,'\n') #df is profile[relevant_columns]
print(df.stack().loc[lambda x: x!='not received']
.groupby(level=1).value_counts(normalize=True).round(4).mul(100).unstack(0))
旁注:如果要保留精确的列顺序,请在末尾使用
reindex
:
overall = (profile[relevant_columns].stack().loc[lambda x: x!='not received'].
groupby(level=1).value_counts(normalize=True)
.round(4).mul(100).unstack(0).reindex(columns=df.columns))
ae264e3637204a6fb9bb56bc8210ddfd 2906b810c7d4411798c6938adc9daaa5
completed 100.0 50.0
viewed NaN 50.0
一种方法是在列中循环。是的,您正在遍历列,但我也在使用我的方法避免
lambda x
。然后,将每个新系列添加到列表后,只需将系列列表concat放在一起:
s = []
for col in [*profile.columns]:
(s.append(round(profile.loc[profile[col] != 'not received',[col]]
.value_counts(normalize=True)*100, 4)))
df = pd.concat(s, axis=1, keys=relevant_columns)
df
Out[1]:
ae264e3637204a6fb9bb56bc8210ddfd 2906b810c7d4411798c6938adc9daaa5
completed 100.0 50.0
viewed NaN 50.0
让我们
屏蔽相关列中的未接收的值
,然后使用normalize=True
应用pd.value\u计数
,以计算每列唯一值的比例:
profile[relevant_columns].mask(lambda x: x.eq('not received'))\
.apply(pd.value_counts, normalize=True).mul(100).round(4)
请提供一个最小的、完整的数据框和您的预期输出。虽然您收到了一些答案,但您的问题是不完整的。我想这可能会有所帮助,因为您共享的是前5行和后5行数据,但您的预期输出是基于整个数据帧的。这就是为什么最好根据问题中的输入数据使输出可再现。我希望这能有所帮助。谢谢,这是我最喜欢的解决方案,也谢谢大家的努力。从这个解决方案中学到很多东西:“@ DATAMAMALY所有的解决方案都是好的。如果你觉得它们很有用,请考虑其他的解决方案:”DATAMAMASE我也学到了一些东西:
profile[relevant_columns].mask(lambda x: x.eq('not received'))\
.apply(pd.value_counts, normalize=True).mul(100).round(4)
ae264e3637204a6fb9bb56bc8210ddfd 2906b810c7d4411798c6938adc9daaa5
completed 100.0 50.0
viewed NaN 50.0