Python Pandas-将列旋转为(有条件的)聚合字符串
假设我将以下数据集转换为数据帧:Python Pandas-将列旋转为(有条件的)聚合字符串,python,pandas,pivot-table,aggregation,Python,Pandas,Pivot Table,Aggregation,假设我将以下数据集转换为数据帧: data = [ ['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'], ['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'], ['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'], ['Job 1', datetime.date(2019, 6, 10), 'Bill', None]
data = [
['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])
这将生成一个数据帧,看起来像:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim Tom
1 Job 1 2019-06-09 Bill Tom
2 Job 1 2019-06-09 Tom Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 Tom Tom
我试图生成的是每个唯一的Job/Date组合上的一个轴心,其中一列为Manager,另一列为字符串,其中包含逗号分隔的非Manager雇员。有两件事需要假设:
Job Date Manager Employees
0 Job 1 2019-06-09 Tom Jim, Bill
1 Job 1 2019-06-10 None Bill
2 Job 2 2019-06-10 Tom None
这就引出了我的问题:
df.groupby(["Job","Date","Manager"]).apply( lambda x: ",".join(x.Employee))
这将找到所有唯一的工作日期和经理集,并将员工与“,”放在一个字符串中进行聚合,然后通过删除经理并在适当情况下设置为“无”来修复员工。因为员工是独一无二的,所以在这里集合可以很好地将经理移除
s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})
这里棘手的部分是将经理从员工栏中删除
我倾向于用期望的结果构建一个字典并重建数据帧
d = {}
for t in df.itertuples():
d_ = d.setdefault((t.Job, t.Date), {})
d_['Manager'] = t.Manager
d_.setdefault('Employees', set()).add(t.Employee)
for k, v in d.items():
v['Employees'] -= {v['Manager']}
v['Employees'] = ', '.join(v['Employees'])
pd.DataFrame(d.values(), d).rename_axis(['Job', 'Date']).reset_index()
Job Date Employees Manager
0 Job 1 2019-06-09 Bill, Jim Tom
1 Job 1 2019-06-10 Bill None
2 Job 2 2019-06-10 Tom
在您的情况下,尝试不使用lambda
transform
+drop\u duplicates
df['Employee']=df['Employee'].mask(df['Employee'].eq(df.Manager)).dropna().groupby([df['Job'], df['Date']]).transform('unique').str.join(',')
df=df.drop_duplicates(['Job','Date'])
df
Out[745]:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim,Bill Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 NaN Tom
员工和经理可以有相同的名字吗?不可以-请参阅我假设中的第1点-
所有员工的名字都是唯一的(我实际上使用的是员工ID而不是姓名)
。我想澄清一下,经理被认为是“员工”,但经理只是工作中的一个特殊角色。就在“假设”下:“我希望结果数据框看起来像:“是的,对不起。。。看到了吗
Employee Manager
Job Date
Job 1 2019-06-09 Jim,Bill Tom
2019-06-10 Bill None
Job 2 2019-06-10 NaN Tom
d = {}
for t in df.itertuples():
d_ = d.setdefault((t.Job, t.Date), {})
d_['Manager'] = t.Manager
d_.setdefault('Employees', set()).add(t.Employee)
for k, v in d.items():
v['Employees'] -= {v['Manager']}
v['Employees'] = ', '.join(v['Employees'])
pd.DataFrame(d.values(), d).rename_axis(['Job', 'Date']).reset_index()
Job Date Employees Manager
0 Job 1 2019-06-09 Bill, Jim Tom
1 Job 1 2019-06-10 Bill None
2 Job 2 2019-06-10 Tom
df['Employee']=df['Employee'].mask(df['Employee'].eq(df.Manager)).dropna().groupby([df['Job'], df['Date']]).transform('unique').str.join(',')
df=df.drop_duplicates(['Job','Date'])
df
Out[745]:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim,Bill Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 NaN Tom