Python 熊猫如何通过获取value=';是';在每列的任意行中
我需要将行与列“name”组合起来,表中的某些行具有值为“yes”的不同列,如下所示 以下模板给出了输入和预期输出:Python 熊猫如何通过获取value=';是';在每列的任意行中,python,python-3.x,pandas,Python,Python 3.x,Pandas,我需要将行与列“name”组合起来,表中的某些行具有值为“yes”的不同列,如下所示 以下模板给出了输入和预期输出: name department feature1 feature2 feature3 x1 cs yes yes x1 cs yes x1 ec x2 cs yes
name department feature1 feature2 feature3
x1 cs yes yes
x1 cs yes
x1 ec
x2 cs yes yes
x2 ec yes
我需要得到的输出是:
x1 cs yes yes yes
x1 ec
x2 cs yes yes
x2 ec yes
建议请使用python和pandas。您可以使用:
#if want filter only `yes` values
cols = df.columns.difference(['name','department'])
df[cols] = df[cols] == 'yes'
print (df)
name department feature1 feature2 feature3
0 x1 cs False True True
1 x1 cs True False False
2 x1 ec False False False
3 x2 cs True True False
4 x2 ec False True False
然后通过聚合和最后一次通过dict
:
df= df.groupby(['name','department']) \
.max() \
.replace({True:'yes',False:np.nan}) \
.reset_index()
print (df)
name department feature1 feature2 feature3
0 x1 cs yes yes yes
1 x1 ec NaN NaN NaN
2 x2 cs yes yes NaN
3 x2 ec NaN yes NaN
感谢您的评论,也可以使用:
如果所有值仅为yes
和NaN
s:
df = df.fillna('').groupby(['name', 'department']).max().reset_index()
print (df)
name department feature1 feature2 feature3
0 x1 cs yes yes yes
1 x1 ec
2 x2 cs yes yes
3 x2 ec yes
编辑:
您可以使用聚合函数通过dict comprehension
创建自定义dict
,并使用:
您是否保证
yes
不会重叠,然后您可以:df.groupby(['name','department']).sum()
不使用.Any()
而不是.max()
(第一个示例)?似乎更适合bool类型和短路。注意:.any()
将使用原始数据而不进行任何映射。@AChampion-您是正确的,也可以使用any
。谢谢。感谢博尼法西奥,耶斯雷尔,@AChampion的及时回复。我尝试了建议的选项,效果很好。需要对原始问题再添加一条评论。我有更多的列和那些列值,我需要maintain@sri-有更多的列需要以另一种方式处理?你能解释更多吗?姓名部门特征1特征2特征3计数x1 cs是10 x1 cs是x1 ec x2 cs是x2 ec是20
df = df.fillna('').groupby(['name', 'department']).max().reset_index()
print (df)
name department feature1 feature2 feature3
0 x1 cs yes yes yes
1 x1 ec
2 x2 cs yes yes
3 x2 ec yes
d = {'feature3': ['yes', np.nan, np.nan, np.nan, np.nan],
'feature2': ['yes', np.nan, np.nan, 'yes', 'yes'],
'name': ['x1', 'x1', 'x1', 'x2', 'x2'],
'count': [10.0, 30.0, np.nan, 20.0, 3.0],
'feature1': [np.nan, 'yes', np.nan, 'yes', np.nan],
'department': ['cs', 'cs', 'ec', 'cs', 'ec'],
'description': ['xsdepartment1', 'xsdepartment2', np.nan, 'department1', 'department3']}
c = ['name','department','feature1','feature2','feature3','count','description']
df = pd.DataFrame(d, columns = c)
print (df)
name department feature1 feature2 feature3 count description
0 x1 cs NaN yes yes 10.0 xsdepartment1
1 x1 cs yes NaN NaN 30.0 xsdepartment2
2 x1 ec NaN NaN NaN NaN NaN
3 x2 cs yes yes NaN 20.0 department1
4 x2 ec NaN yes NaN 3.0 department3
cols = df.columns.difference(['name','department','count','description'])
f = lambda x: tuple(x)
d = {x:'max' for x in cols}
d['count'] = f
d['description'] = f
print (d)
{'feature3': 'max',
'feature1': 'max',
'feature2': 'max',
'description': <function <lambda> at 0x000000000F6FC598>,
'count': <function <lambda> at 0x000000000F6FC598>}
df[cols] = df[cols] == 'yes'
print (df)
name department feature1 feature2 feature3 count description
0 x1 cs False True True 10.0 xsdepartment1
1 x1 cs True False False 30.0 xsdepartment2
2 x1 ec False False False NaN NaN
3 x2 cs True True False 20.0 department1
4 x2 ec False True False 3.0 department3
df = df.groupby(['name', 'department']).agg(d).reset_index()
df[cols] = df[cols].replace({True:'yes',False:np.nan})
print (df)
name department feature3 feature1 feature2 description \
0 x1 cs yes yes yes (xsdepartment1, xsdepartment2)
1 x1 ec NaN NaN NaN (nan,)
2 x2 cs NaN yes yes (department1,)
3 x2 ec NaN NaN yes (department3,)
count
0 (10.0, 30.0)
1 (nan,)
2 (20.0,)
3 (3.0,)