Python 如何每行仅合并一次数据帧
抱歉,我对这一切都很陌生,所以请原谅我可能会问的任何愚蠢的问题 我正在尝试合并salesforce使用pandas的两份报告 我有两份报告: 及 并希望生产: 我尝试过合并、连接和对齐数据帧,但是,因为这两个数据帧共享相同的索引,所以我的工厂报告的第一行重复5次,即: 我的下一个想法是使用.align(),然后使用.duplicated()返回一个显示重复行的布尔序列,并清除重复行的内容。如果使用此方法,如何才能清除行的内容 而且这似乎效率很低,很难做到,有没有更优雅的解决方案Python 如何每行仅合并一次数据帧,python,pandas,data-wrangling,Python,Pandas,Data Wrangling,抱歉,我对这一切都很陌生,所以请原谅我可能会问的任何愚蠢的问题 我正在尝试合并salesforce使用pandas的两份报告 我有两份报告: 及 并希望生产: 我尝试过合并、连接和对齐数据帧,但是,因为这两个数据帧共享相同的索引,所以我的工厂报告的第一行重复5次,即: 我的下一个想法是使用.align(),然后使用.duplicated()返回一个显示重复行的布尔序列,并清除重复行的内容。如果使用此方法,如何才能清除行的内容 而且这似乎效率很低,很难做到,有没有更优雅的解决方案 干杯以下是一
干杯以下是一些示例数据 劳工报告.csv
DJR Number,Date,Foreman,Labour Name,Start,Finish,Hours Worked
1,1/01/2020,Emersen,Tai,6:30,15:00,8:30
1,1/01/2020,Emersen,Rachel,6:30,15:00,8:30
1,1/01/2020,Emersen,Andrew,6:30,15:00,8:30
1,1/01/2020,Emersen,Liam,6:30,15:00,8:30
2,2/01/2020,Scott,Rachel,6:30,15:00,8:30
3,3/01/2020,Michael,Andrew,6:30,15:00,8:30
DJR Number,Date,Foreman,Plant,Plant Hours
1,1/01/2020,Emersen,1,5
2,1/01/2020,Scott,2,5
3,1/01/2020,Michael,2,5
plant-report.csv
DJR Number,Date,Foreman,Labour Name,Start,Finish,Hours Worked
1,1/01/2020,Emersen,Tai,6:30,15:00,8:30
1,1/01/2020,Emersen,Rachel,6:30,15:00,8:30
1,1/01/2020,Emersen,Andrew,6:30,15:00,8:30
1,1/01/2020,Emersen,Liam,6:30,15:00,8:30
2,2/01/2020,Scott,Rachel,6:30,15:00,8:30
3,3/01/2020,Michael,Andrew,6:30,15:00,8:30
DJR Number,Date,Foreman,Plant,Plant Hours
1,1/01/2020,Emersen,1,5
2,1/01/2020,Scott,2,5
3,1/01/2020,Michael,2,5
下面的代码可以使用。逻辑是添加一个额外的列来标记行是否需要联接。然后我们可以在pandas上进行左合并,然后在合并后删除这个额外的列
import pandas as pd
import numpy as np
labour_report = pd.read_csv('labour-report.csv')
plant_report = pd.read_csv('plant-report.csv')
should_join_row_lbr=[]
foremen = {}
for index, labour_report_row in labour_report.iterrows():
foreman = labour_report_row['Foreman']
if foreman in foremen:
should_join_row_lbr.append(0) # Dont join this row
else:
foremen[foreman]=1
# Join this row as this is the first record for the foreman
should_join_row_lbr.append(1)
labour_report['Should Join Row']=should_join_row_lbr
should_join_row_plnt = [1]*plant_report['DJR Number'].count()
plant_report['Should Join Row']=should_join_row_plnt
# Do a left join with Should Join column as well,
# Only the records that have value 1 will be joined from labour_report
combined_report = pd.merge(labour_report,plant_report,on=['Foreman','Should Join Row'],how='left')
combined_report = combined_report.drop(columns=['Should Join Row'])
combined_report = combined_report.replace(np.nan, '', regex=True)
print(combined_report)
combined_report.to_csv('combined_report.csv')
应该产生结果
DJR Number_x Date_x Foreman Labour Name Start Finish Hours Worked DJR Number_y Date_y Plant Plant Hours
0 1 1/01/2020 Emersen Tai 6:30 15:00 8:30 1 1/01/2020 1 5
1 1 1/01/2020 Emersen Rachel 6:30 15:00 8:30
2 1 1/01/2020 Emersen Andrew 6:30 15:00 8:30
3 1 1/01/2020 Emersen Liam 6:30 15:00 8:30
4 2 2/01/2020 Scott Rachel 6:30 15:00 8:30 2 1/01/2020 2 5
5 3 3/01/2020 Michael Andrew 6:30 15:00 8:30 3 1/01/2020 2 5
下面是一些示例数据 劳工报告.csv
DJR Number,Date,Foreman,Labour Name,Start,Finish,Hours Worked
1,1/01/2020,Emersen,Tai,6:30,15:00,8:30
1,1/01/2020,Emersen,Rachel,6:30,15:00,8:30
1,1/01/2020,Emersen,Andrew,6:30,15:00,8:30
1,1/01/2020,Emersen,Liam,6:30,15:00,8:30
2,2/01/2020,Scott,Rachel,6:30,15:00,8:30
3,3/01/2020,Michael,Andrew,6:30,15:00,8:30
DJR Number,Date,Foreman,Plant,Plant Hours
1,1/01/2020,Emersen,1,5
2,1/01/2020,Scott,2,5
3,1/01/2020,Michael,2,5
plant-report.csv
DJR Number,Date,Foreman,Labour Name,Start,Finish,Hours Worked
1,1/01/2020,Emersen,Tai,6:30,15:00,8:30
1,1/01/2020,Emersen,Rachel,6:30,15:00,8:30
1,1/01/2020,Emersen,Andrew,6:30,15:00,8:30
1,1/01/2020,Emersen,Liam,6:30,15:00,8:30
2,2/01/2020,Scott,Rachel,6:30,15:00,8:30
3,3/01/2020,Michael,Andrew,6:30,15:00,8:30
DJR Number,Date,Foreman,Plant,Plant Hours
1,1/01/2020,Emersen,1,5
2,1/01/2020,Scott,2,5
3,1/01/2020,Michael,2,5
下面的代码可以使用。逻辑是添加一个额外的列来标记行是否需要联接。然后我们可以在pandas上进行左合并,然后在合并后删除这个额外的列
import pandas as pd
import numpy as np
labour_report = pd.read_csv('labour-report.csv')
plant_report = pd.read_csv('plant-report.csv')
should_join_row_lbr=[]
foremen = {}
for index, labour_report_row in labour_report.iterrows():
foreman = labour_report_row['Foreman']
if foreman in foremen:
should_join_row_lbr.append(0) # Dont join this row
else:
foremen[foreman]=1
# Join this row as this is the first record for the foreman
should_join_row_lbr.append(1)
labour_report['Should Join Row']=should_join_row_lbr
should_join_row_plnt = [1]*plant_report['DJR Number'].count()
plant_report['Should Join Row']=should_join_row_plnt
# Do a left join with Should Join column as well,
# Only the records that have value 1 will be joined from labour_report
combined_report = pd.merge(labour_report,plant_report,on=['Foreman','Should Join Row'],how='left')
combined_report = combined_report.drop(columns=['Should Join Row'])
combined_report = combined_report.replace(np.nan, '', regex=True)
print(combined_report)
combined_report.to_csv('combined_report.csv')
应该产生结果
DJR Number_x Date_x Foreman Labour Name Start Finish Hours Worked DJR Number_y Date_y Plant Plant Hours
0 1 1/01/2020 Emersen Tai 6:30 15:00 8:30 1 1/01/2020 1 5
1 1 1/01/2020 Emersen Rachel 6:30 15:00 8:30
2 1 1/01/2020 Emersen Andrew 6:30 15:00 8:30
3 1 1/01/2020 Emersen Liam 6:30 15:00 8:30
4 2 2/01/2020 Scott Rachel 6:30 15:00 8:30 2 1/01/2020 2 5
5 3 3/01/2020 Michael Andrew 6:30 15:00 8:30 3 1/01/2020 2 5
假设您不需要在最终结果上重复“DJR Number”列。我们可以根据DJR编号连接这两个数据帧。Duplicated函数将返回一系列关于它们是否重复的布尔值。如果它们是真的,我们给它赋值为空
merged = labour.merge(plant,on='DJR Number')
merged.loc[merged['DJR Number'].duplicated(),['Date_y','Foreman_y','Plant','Plant Hours']] = np.NaN
如果您想要更清晰的结果,可以合并到三列上
merged = labour.merge(plant,on=['DJR Number','Date','Foreman'])
merged.loc[merged['DJR Number'].duplicated(),['Plant','Plant Hours']] = np.NaN
这不会导致任何重复的列
请参阅文档。假设您不需要在最终结果上重复“DJR编号”列。我们可以根据DJR编号连接这两个数据帧。Duplicated函数将返回一系列关于它们是否重复的布尔值。如果它们是真的,我们给它赋值为空
merged = labour.merge(plant,on='DJR Number')
merged.loc[merged['DJR Number'].duplicated(),['Date_y','Foreman_y','Plant','Plant Hours']] = np.NaN
如果您想要更清晰的结果,可以合并到三列上
merged = labour.merge(plant,on=['DJR Number','Date','Foreman'])
merged.loc[merged['DJR Number'].duplicated(),['Plant','Plant Hours']] = np.NaN
这不会导致任何重复的列
请参阅文档。您能提供人们可以使用的复制/可复制数据吗?您能提供人们可以使用的复制/可复制数据吗?谢谢,伙计@EmersenTamura Paki如果这对您有效,请将答案标记为正确。如果你需要改进,请随意添加评论。对不起,出于某种原因,我没有选择正确答案的权利。我觉得我对这个网站太陌生了。@emersentamurapaki没问题,我很高兴问题解决了。谢谢你,伙计@EmersenTamura Paki如果这对您有效,请将答案标记为正确。如果你需要改进,请随意添加评论。对不起,出于某种原因,我没有选择正确答案的权利。我觉得我对这个网站太陌生了。@emersentamurapaki没问题,我很高兴问题解决了。