Pandas 创建包含不同条目的计数的新列
我正在学习熊猫,并下载了2008年奥运会所有奖牌成绩的数据集。其形式如下:Pandas 创建包含不同条目的计数的新列,pandas,Pandas,我正在学习熊猫,并下载了2008年奥运会所有奖牌成绩的数据集。其形式如下: In[138]: medals.head() Out[138]: City Edition Sport Discipline Athlete NOC \ 9792 Rome 1960 Aquatics Diving PHELPS, Brian Eric GBR 9793 Rome 1960 A
In[138]: medals.head()
Out[138]:
City Edition Sport Discipline Athlete NOC \
9792 Rome 1960 Aquatics Diving PHELPS, Brian Eric GBR
9793 Rome 1960 Aquatics Diving WEBSTER, Robert David USA
9794 Rome 1960 Aquatics Diving TOBIAN, Gary Milburn USA
9795 Rome 1960 Aquatics Diving KRUTOVA, Ninel URS
9796 Rome 1960 Aquatics Diving KRÄMER-ENGEL-GULBIN, Ingrid EUA
Gender Event Event_gender Medal
9792 Men 10m platform M Bronze
9793 Men 10m platform M Gold
9794 Men 10m platform M Silver
9795 Women 10m platform W Bronze
9796 Women 10m platform W Gold
我最初想做的是将其转换为一个数据框架,其中包含Edition、NOC、brown、Silver、Gold列,其中brown、Silver和Gold是国家奥委会在该届奥运会上获得的各级别奖牌的总数
到目前为止,我已经
"""
Analyze historical Olympic performance
"""
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.style.use('ggplot')
isocodes = pd.read_csv('countrycodes.csv')
for k in ['official_name_en', 'official_name_fr', 'name',
'ITU', 'MARC', 'WMO', 'DS', 'Dial', 'FIFA',
'FIPS', 'GAUL', 'IOC', 'ISO4217-currency_alphabetic_code',
'ISO4217-currency_country_name', 'ISO4217-currency_minor_unit',
'ISO4217-currency_name', 'ISO4217-currency_numeric_code',
'is_independent', 'Capital', 'TLD', 'Languages',
'geonameid', 'EDGAR' ]:
del isocodes[k]
allmedals = pd.read_excel('medals.xlsx', sheetname='Medals')
ioccodes = pd.read_excel('medals.xlsx', sheetname='Codes')
del ioccodes['Country.1']
codes=pd.merge(ioccodes, isocodes, left_on='ISO code',
right_on='ISO3166-1-Alpha-2')
# Convert the year of the games to int from str and
# then filter out all records before 1960
pd.to_numeric(allmedals['Edition'])
medals = allmedals[(allmedals['Edition'] >= 1960)]
# Filter out any duplicates - i.e. for events like the relay
# where each team member is awarded a medal
medals = medals.drop_duplicates(['City', 'Edition', 'Sport',
'Discipline', 'NOC', 'Gender',
'Event', 'Event_gender', 'Medal'])
# Now get the medal counts for each Olympics
grouped = medals.groupby(["Edition", "NOC", "Medal"])["Medal"].\
count().reset_index(name="count")
我知道这一定是一个相当标准的熊猫行动,我几乎做到了:
In[139]: grouped.head()
Out[139]:
Edition NOC Medal count
0 1960 ARG Bronze 1
1 1960 ARG Silver 1
2 1960 AUS Bronze 6
3 1960 AUS Gold 8
4 1960 AUS Silver 8
但我无法确定如何对分组的数据帧进行分组/聚合。如果有任何提示(以及任何其他建议,例如使用del
,drop\u duplicates()
等是否被视为良好做法?)我将不胜感激。取消标记奖牌
栏:
res = grouped.set_index(['Edition', 'NOC', 'Medal']).unstack('Medal', fill_value=0)
res.columns = res.columns.droplevel(0)
输出(来自引用的分组.head()
):
样品df
解决方案
您能否显示最终数据帧的外观?从构建分组的行中删除.reset\u index(name=“count”)
,这也可以通过快速补充来完成。res有一个由Edition和NOC列的元组组成的索引。我如何修改res,使其具有五列:Edition、NOC、brown、Silver、Gold,而不是元组索引?(同样,我确信这必须是基本的)。在末尾添加一行:res=res.reset\u index()
。
Medal Bronze Gold Silver
Edition NOC
1960 ARG 1 0 1
AUS 6 8 8
ioccodes = ['ABC', 'BCD', 'CDE', 'DEF', 'EFG', 'FGH', 'GHI']
idx = pd.MultiIndex.from_product([np.arange(1960, 2016, 4), ['Gold', 'Silver', 'Bronze']], names=['Edition', 'Medal'])
df = pd.DataFrame({'NOC': np.random.choice(ioccodes, len(idx))}, idx).reset_index()
df.groupby(['Edition', 'Medal']).NOC.value_counts() \
.unstack(1).fillna(0).reset_index().rename_axis(None, 1)