Python 如何在Pandas中使用Groupby在聚合后取消堆栈
你好,数据科学家和熊猫专家 在应用groupby聚合方法后,我需要一些帮助来找出如何更好地组织数据。我已经尝试取消堆叠到新的数据帧,但它没有产生预期的结果 这是我的数据框:Python 如何在Pandas中使用Groupby在聚合后取消堆栈,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,你好,数据科学家和熊猫专家 在应用groupby聚合方法后,我需要一些帮助来找出如何更好地组织数据。我已经尝试取消堆叠到新的数据帧,但它没有产生预期的结果 这是我的数据框: df = [{'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\ {'Store': 's1', '
df = [{'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-02 00:00:00'), 'Employee': 'a', 'Department': 'd2', 'ID': 's1ad2', 'Level': 2, 'duties': 'C'},\
{'Store': 's1', 'Date': Timestamp('2020-08-03 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-04 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-05 00:00:00'), 'Employee': 'a', 'Department': 'd2', 'ID': 's1ad2', 'Level': 2, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'a', 'Department': 'd3', 'ID': 's2ad3', 'Level': 2, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'a', 'Department': 'd3', 'ID': 's3ad1', 'Level': 2, 'duties': 'C'},\
{'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-02 00:00:00'), 'Employee': 'b', 'Department': 'd2', 'ID': 's1bd2', 'Level': 1, 'duties': 'C'},\
{'Store': 's1', 'Date': Timestamp('2020-08-03 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-04 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-05 00:00:00'), 'Employee': 'b', 'Department': 'd2', 'ID': 's1bd2', 'Level': 1, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2ac1', 'Level': 3, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'c', 'Department': 'd3', 'ID': 's2cd3', 'Level': 3, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2cd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2cd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'c', 'Department': 'd3', 'ID': 's3cd1', 'Level': 3, 'duties': 'C'},\
{'Store': 's3', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'd', 'Department': 'd1', 'ID': 's3cd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's3', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'd', 'Department': 'd3', 'ID': 's3dd3', 'Level': 3, 'duties': 'C'},\
{'Store': 's3', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'd', 'Department': 'd1', 'ID': 's3dd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's3', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'd','Department': 'd1', 'ID': 's3dd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's3', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'd', 'Department': 'd3', 'ID': 's3dd1', 'Level': 3, 'duties': 'C'}]
我想组织我的输出,使其在商店-->部门-->员工中整齐地堆叠。如下所示(很抱歉,输出没有很好地排列):
因此,我试着按表达式分组如下:
df = df.groupby(['Employee', 'Store', 'Department'])\
.agg({'Date':['first', 'last', 'size'],
'ID': 'first',
'Level': 'first',
'duties': 'first'})
# Join the Each Column with its operation.
df.columns = df.columns.map('_'.join)
# Reset the Index
df = df.reset_index().set_index('Employee')
# Renaming Columns of Dataframe.
df.rename(columns={'Date_first':'First Shift',
'Date_last':'Last Shift',
'Date_size':'# of shift',
'ID_first':'ID',
'Level_first':'Level',
'duties_first':'duties'},
inplace=True)
这将打印以下结果:
Store Department First Shift Last Shift # of shift ID_first Level duties
Employee
a s1 d1 2020-08-01 2020-08-04 3 s1ad1 2 O
a s1 d2 2020-08-02 2020-08-05 2 s1ad2 2 C
a s2 d1 2020-08-08 2020-08-11 3 s2ad1 2 O
a s2 d3 2020-08-09 2020-08-12 2 s2ad3 2 C
b s1 d1 2020-08-01 2020-08-04 3 s1bd1 1 O
b s1 d2 2020-08-02 2020-08-05 2 s1bd2 1 C
c s2 d1 2020-08-08 2020-08-11 3 s2ac1 3 O
c s2 d3 2020-08-09 2020-08-12 2 s2cd3 3 C
d s3 d1 2020-08-08 2020-08-11 3 s3cd1 3 O
d s3 d3 2020-08-09 2020-08-12 2 s3dd3 3 C
然后我应用了unstack表达式,如下所示:
df = df.groupby(['Employee', 'Store', 'Department', 'First Shift', 'Last Shift', '# of shift', 'ID_first',\
'Level', 'duties'])\
.size()\
.unstack(['Store', 'Department']).fillna(0)
Store s1 s2 s3
Department d1 d2 d1 d3 d1 d3
Employee First Shift Last Shift # of shift ID_first Level duties
a 2020-08-01 2020-08-04 3 s1ad1 2 O 1.0 0.0 0.0 0.0 0.0 0.0
2020-08-02 2020-08-05 2 s1ad2 2 C 0.0 1.0 0.0 0.0 0.0 0.0
2020-08-08 2020-08-11 3 s2ad1 2 O 0.0 0.0 1.0 0.0 0.0 0.0
2020-08-09 2020-08-12 2 s2ad3 2 C 0.0 0.0 0.0 1.0 0.0 0.0
b 2020-08-01 2020-08-04 3 s1bd1 1 O 1.0 0.0 0.0 0.0 0.0 0.0
2020-08-02 2020-08-05 2 s1bd2 1 C 0.0 1.0 0.0 0.0 0.0 0.0
c 2020-08-08 2020-08-11 3 s2ac1 3 O 0.0 0.0 1.0 0.0 0.0 0.0
2020-08-09 2020-08-12 2 s2cd3 3 C 0.0 0.0 0.0 1.0 0.0 0.0
d 2020-08-08 2020-08-11 3 s3cd1 3 O 0.0 0.0 0.0 0.0 1.0 0.0
2020-08-09 2020-08-12 2 s3dd3 3 C 0.0 0.0 0.0 0.0 0.0 1.0
它将打印结果,如下所示:
df = df.groupby(['Employee', 'Store', 'Department', 'First Shift', 'Last Shift', '# of shift', 'ID_first',\
'Level', 'duties'])\
.size()\
.unstack(['Store', 'Department']).fillna(0)
Store s1 s2 s3
Department d1 d2 d1 d3 d1 d3
Employee First Shift Last Shift # of shift ID_first Level duties
a 2020-08-01 2020-08-04 3 s1ad1 2 O 1.0 0.0 0.0 0.0 0.0 0.0
2020-08-02 2020-08-05 2 s1ad2 2 C 0.0 1.0 0.0 0.0 0.0 0.0
2020-08-08 2020-08-11 3 s2ad1 2 O 0.0 0.0 1.0 0.0 0.0 0.0
2020-08-09 2020-08-12 2 s2ad3 2 C 0.0 0.0 0.0 1.0 0.0 0.0
b 2020-08-01 2020-08-04 3 s1bd1 1 O 1.0 0.0 0.0 0.0 0.0 0.0
2020-08-02 2020-08-05 2 s1bd2 1 C 0.0 1.0 0.0 0.0 0.0 0.0
c 2020-08-08 2020-08-11 3 s2ac1 3 O 0.0 0.0 1.0 0.0 0.0 0.0
2020-08-09 2020-08-12 2 s2cd3 3 C 0.0 0.0 0.0 1.0 0.0 0.0
d 2020-08-08 2020-08-11 3 s3cd1 3 O 0.0 0.0 0.0 0.0 1.0 0.0
2020-08-09 2020-08-12 2 s3dd3 3 C 0.0 0.0 0.0 0.0 0.0 1.0
我认为我错误地使用了大小和展开。然而,我似乎不知道如何重新组织数据
我非常感谢专家对如何正确组织我的数据的意见
再次感谢您的帮助和关心
谢谢。问题的关键在于,您需要在使用
.unstack()
之前重新构造数据,因为您需要的格式是一个矩阵,其值是三个重复的列。因此,您需要将数据帧从宽改为长,并在一列中创建一个新列,其中包含这三个值values
,另一列对它们进行分类Shift
# Step 1: Named Groupby Agregation naming columns ins specific format required for `pd.wide_to_long`. Must end with integer.
df = (df.groupby(['Employee', 'Store', 'Department'])
.agg(Shift_1=('Date','first'),
Shift_2=('Date','last'),
Shift_3=('Date','size'),
ID=('ID', 'first'),
Level=('Level', 'first'),
duties=('duties', 'first'))
.reset_index())
# Step 2: In preparation for a matrix The data must be transformed so that the three columns that are values in the matrix must be in long format.
df = pd.wide_to_long(df, stubnames='Shift_', i='ID', j='Shift').reset_index().rename(columns={'Shift_':'Values'})
# Step 3: 1,2,3 were required integer suffixes for wide_to_long but now let's change to what we want the columns to be called.
df['Shift'] = df['Shift'].replace([1,2,3],['First Shift','Last Shift','# of shift'])
# Step 4: Create the matrix be setting index and unstacking to columns
df = (df.sort_values(['Employee', 'Store', 'Department']) #values must be sorted in order for how we want columns to appear in matrix format
.set_index(['Employee', 'ID', 'Level', 'duties', 'Store', 'Department', 'Shift'])
.unstack(['Store', 'Department', 'Shift']).fillna(0))
# Step 5: Cleanup of Multi-index into desred format
df.columns = df.columns.reorder_levels([1,2,3,0]).droplevel(3)
df = df.reset_index()
df
Out[1]:
Store Employee ID Level duties s1 \
Department d1
Shift First Shift
0 a s1ad1 2 O 2020-08-01 00:00:00
1 a s1ad2 2 C 0
2 a s2ad1 2 O 0
3 a s2ad3 2 C 0
4 b s1bd1 1 O 2020-08-01 00:00:00
5 b s1bd2 1 C 0
6 c s2ac1 3 O 0
7 c s2cd3 3 C 0
8 d s3cd1 3 O 0
9 d s3dd3 3 C 0
Store \
Department d2
Shift Last Shift # of shift First Shift
0 2020-08-04 00:00:00 3 0
1 0 0 2020-08-02 00:00:00
2 0 0 0
3 0 0 0
4 2020-08-04 00:00:00 3 0
5 0 0 2020-08-02 00:00:00
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Store ... s2 \
Department ... d1
Shift Last Shift # of shift ... # of shift
0 0 0 ... 0
1 2020-08-05 00:00:00 2 ... 0
2 0 0 ... 3
3 0 0 ... 0
4 0 0 ... 0
5 2020-08-05 00:00:00 2 ... 0
6 0 0 ... 3
7 0 0 ... 0
8 0 0 ... 0
9 0 0 ... 0
Store \
Department d3
Shift First Shift Last Shift # of shift
0 0 0 0
1 0 0 0
2 0 0 0
3 2020-08-09 00:00:00 2020-08-12 00:00:00 2
4 0 0 0
5 0 0 0
6 0 0 0
7 2020-08-09 00:00:00 2020-08-12 00:00:00 2
8 0 0 0
9 0 0 0
Store s3 \
Department d1
Shift First Shift Last Shift # of shift
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 2020-08-08 00:00:00 2020-08-11 00:00:00 3
9 0 0 0
Store
Department d3
Shift First Shift Last Shift # of shift
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 2020-08-09 00:00:00 2020-08-12 00:00:00 2
[10 rows x 22 columns]
请单击答案旁边的复选标记,接受其中一个答案作为解决方案。非常感谢。