Python:groupy by查找df的行和列中出现的百分比
我在下面有一个数据框,在Python:groupy by查找df的行和列中出现的百分比,python,group-by,Python,Group By,我在下面有一个数据框,在entryName列中有多个人。基于此列(假设groupby),我想检查其他行,查看C1、C2等中的人员是否在条目名人员其他条目中再次使用。因此,Player1有两个条目,他在两个条目中都使用了Zach Parise。对于Player1,Zachs的使用率为100%Player2有3个条目,他在3个条目中的1个条目中使用了Patrick Kane。对于Player2,Patricks的使用率为33%。上面我假设这将使用某种groupby,但我不确定它将如何检查其余部分。任
entryName
列中有多个人。基于此列(假设groupby
),我想检查其他行
,查看C1、C2等
中的人员是否在条目名
人员其他条目中再次使用。因此,Player1
有两个条目,他在两个条目中都使用了Zach Parise。对于Player1
,Zachs的使用率为100%Player2
有3个条目,他在3个条目中的1个条目中使用了Patrick Kane。对于Player2
,Patricks的使用率为33%。上面我假设这将使用某种groupby
,但我不确定它将如何检查其余部分。任何提示都会非常有用
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.array([['Player1','Mark Scheifele','Pierre-Luc Dubois','Mats Zuccarello','Oliver Bjorkstrand','Nick Foligno','Ryan Suter','Seth Jones','Devan Dubnyk','Zach Parise'],['Player2','Kyle Connor','Pierre-Luc Dubois','Mats Zuccarello','Oliver Bjorkstrand','Nick Foligno','Ryan Suter','Seth Jones','Devan Dubnyk','Boone Jenner'],['Player2','Patrick Kane','Joseph Walter','Luke Kunin','Oliver Bjorkstrand','Mark Foligno','Ryan Suter','Matt Dumba','Alex Stalock','Eric Staal'],['Player2','Boone Jenner','Pierre-Luc Dubois','Mats Zuccarello','Oliver Bjorkstrand','Nick Foligno','Ryan Suter','Seth Jones','Devan Dubnyk','Kyle Connor'],['Player1','Patrick Kane','David Backes','Zach Parise','Oliver Bjorkstrand','Mark Foligno','Jonas Brodin','Matt Dumba','Alex Stalock','Eric Staal']]), columns=['entryName','C1','C2','W1','W2','W3','D1','D2','G','UTIL'])
df
entryName C1 ... G UTIL
0 Player1 Mark Scheifele ... Devan Dubnyk Zach Parise
1 Player2 Kyle Connor ... Devan Dubnyk Boone Jenner
2 Player2 Patrick Kane ... Alex Stalock Eric Staal
3 Player2 Boone Jenner ... Devan Dubnyk Kyle Connor
4 Player1 Patrick Kane ... Alex Stalock Eric Staal
[5 rows x 10 columns]
df_result = pd.DataFrame(data=np.array([['Player1','Mark Scheifele','50.00','Pierre-Luc Dubois','50.00','Mats Zuccarello','50.00','Oliver Bjorkstrand','100.00','Nick Foligno','50.00','Ryan Suter','50.00','Seth Jones','50.00','Devan Dubnyk','50.00','Zach Parise','100.00'],['Player2','Kyle Connor','66.66','Pierre-Luc Dubois','66.66','Mats Zuccarello','66.66','Oliver Bjorkstrand','100.00','Nick Foligno','66.66','Ryan Suter','100.00','Seth Jones','66.66','Devan Dubnyk','66.66','Boone Jenner','66.66'],['Player2','Patrick Kane','33.33','Joseph Walter','33.33','Luke Kunin','33.33','Oliver Bjorkstrand','100.00','Mark Foligno','33.33','Ryan Suter','100.00','Matt Dumba','33.33','Alex Stalock','33.33','Eric Staal','33.33'],['Player2','Boone Jenner','66.66','Pierre-Luc Dubois','66.66','Mats Zuccarello','66.66','Oliver Bjorkstrand','100.00','Nick Foligno','66.66','Ryan Suter','100.00','Seth Jones','66.66','Devan Dubnyk','66.66','Kyle Connor','66.66'],['Player1','Patrick Kane','50.00','David Backes','50.00','Zach Parise','100.00','Oliver Bjorkstrand','100.00','Mark Foligno','50.00','Jonas Brodin','50.00','Matt Dumba','50.00','Alex Stalock','50.00','Eric Staal','50.00']]), columns=['entryName','C1','C1_Spr','C2','C2_Spr','W1','W1_Spr','W2','W2_Spr','W3','W3_Spr','D1','D1_Spr','D2','D2_Spr','G','G_Spr','UTIL','UTIL_Spr'])
df_result
entryName C1 C1_Spr ... G_Spr UTIL UTIL_Spr
0 Player1 Mark Scheifele 50.00 ... 50.00 Zach Parise 100.00
1 Player2 Kyle Connor 66.66 ... 66.66 Boone Jenner 66.66
2 Player2 Patrick Kane 33.33 ... 33.33 Eric Staal 33.33
3 Player2 Boone Jenner 66.66 ... 66.66 Kyle Connor 66.66
4 Player1 Patrick Kane 50.00 ... 50.00 Eric Staal 50.00
[5 rows x 19 columns]
您可以尝试使用
value\u counts()
我不确定是否有更快的方法来处理所有列,但您可以对它们进行循环
result_df = pd.DataFrame()
for col in name_columns:
result_df['{}_Spr'.format(col)] = df.groupby(['entryName'])[col].value_counts() / df.groupby(['entryName'])[col].count() * 100
您可以使用以下解决方案:
df_melt = df.melt(id_vars='entryName')
df_cross = pd.crosstab(df_melt['entryName'], df_melt['value'])
df_pct = df_cross / df.groupby('entryName').size().values[:, None] * 100
print(df_pct.T.sort_index())
此表很好地总结了所有数据:
entryName Player1 Player2
value
Alex Stalock 50.0 33.333333
Boone Jenner 0.0 66.666667
David Backes 50.0 0.000000
Devan Dubnyk 50.0 66.666667
Eric Staal 50.0 33.333333
Jonas Brodin 50.0 0.000000
Joseph Walter 0.0 33.333333
Kyle Connor 0.0 66.666667
Luke Kunin 0.0 33.333333
Mark Foligno 50.0 33.333333
Mark Scheifele 50.0 0.000000
Mats Zuccarello 50.0 66.666667
Matt Dumba 50.0 33.333333
Nick Foligno 50.0 66.666667
Oliver Bjorkstrand 100.0 100.000000
Patrick Kane 50.0 33.333333
Pierre-Luc Dubois 50.0 66.666667
Ryan Suter 50.0 100.000000
Seth Jones 50.0 66.666667
Zach Parise 100.0 0.000000
要达到预期效果,您可以执行以下操作:
# append percents to main dataframe
df_stack = df_pct.stack().round(2)
for col in df.columns[1:]:
df = df.merge(df_stack.to_frame(f'{col}_Spr'), left_on=['entryName', col], right_index=True)
# sort columns in desired order
cols = [df.columns[0]]
for col in df.columns[1:10]:
cols.append(col)
cols.append(f'{col}_Spr')
df = df.reindex(cols, axis=1).sort_index()
输出:
entryName C1 C1_Spr C2 C2_Spr \
0 Player1 Mark Scheifele 50.00 Pierre-Luc Dubois 50.00
1 Player2 Kyle Connor 66.67 Pierre-Luc Dubois 66.67
2 Player2 Patrick Kane 33.33 Joseph Walter 33.33
3 Player2 Boone Jenner 66.67 Pierre-Luc Dubois 66.67
4 Player1 Patrick Kane 50.00 David Backes 50.00
W1 W1_Spr W2 W2_Spr W3 W3_Spr \
0 Mats Zuccarello 50.00 Oliver Bjorkstrand 100.0 Nick Foligno 50.00
1 Mats Zuccarello 66.67 Oliver Bjorkstrand 100.0 Nick Foligno 66.67
2 Luke Kunin 33.33 Oliver Bjorkstrand 100.0 Mark Foligno 33.33
3 Mats Zuccarello 66.67 Oliver Bjorkstrand 100.0 Nick Foligno 66.67
4 Zach Parise 100.00 Oliver Bjorkstrand 100.0 Mark Foligno 50.00
D1 D1_Spr D2 D2_Spr G G_Spr \
0 Ryan Suter 50.0 Seth Jones 50.00 Devan Dubnyk 50.00
1 Ryan Suter 100.0 Seth Jones 66.67 Devan Dubnyk 66.67
2 Ryan Suter 100.0 Matt Dumba 33.33 Alex Stalock 33.33
3 Ryan Suter 100.0 Seth Jones 66.67 Devan Dubnyk 66.67
4 Jonas Brodin 50.0 Matt Dumba 50.00 Alex Stalock 50.00
UTIL UTIL_Spr
0 Zach Parise 100.00
1 Boone Jenner 66.67
2 Eric Staal 33.33
3 Kyle Connor 66.67
4 Eric Staal 50.00
干得好,Mykola!非常感谢你的帮助!
entryName C1 C1_Spr C2 C2_Spr \
0 Player1 Mark Scheifele 50.00 Pierre-Luc Dubois 50.00
1 Player2 Kyle Connor 66.67 Pierre-Luc Dubois 66.67
2 Player2 Patrick Kane 33.33 Joseph Walter 33.33
3 Player2 Boone Jenner 66.67 Pierre-Luc Dubois 66.67
4 Player1 Patrick Kane 50.00 David Backes 50.00
W1 W1_Spr W2 W2_Spr W3 W3_Spr \
0 Mats Zuccarello 50.00 Oliver Bjorkstrand 100.0 Nick Foligno 50.00
1 Mats Zuccarello 66.67 Oliver Bjorkstrand 100.0 Nick Foligno 66.67
2 Luke Kunin 33.33 Oliver Bjorkstrand 100.0 Mark Foligno 33.33
3 Mats Zuccarello 66.67 Oliver Bjorkstrand 100.0 Nick Foligno 66.67
4 Zach Parise 100.00 Oliver Bjorkstrand 100.0 Mark Foligno 50.00
D1 D1_Spr D2 D2_Spr G G_Spr \
0 Ryan Suter 50.0 Seth Jones 50.00 Devan Dubnyk 50.00
1 Ryan Suter 100.0 Seth Jones 66.67 Devan Dubnyk 66.67
2 Ryan Suter 100.0 Matt Dumba 33.33 Alex Stalock 33.33
3 Ryan Suter 100.0 Seth Jones 66.67 Devan Dubnyk 66.67
4 Jonas Brodin 50.0 Matt Dumba 50.00 Alex Stalock 50.00
UTIL UTIL_Spr
0 Zach Parise 100.00
1 Boone Jenner 66.67
2 Eric Staal 33.33
3 Kyle Connor 66.67
4 Eric Staal 50.00