Pandas 比较熊猫的多个记录
尝试比较df1和df2中的“Cntr No”,df2[“人工成本”、“材料成本”、“估算货币金额”]任一列中的值必须与df1的总额相匹配 例如,df1 OOLU 3868088与df2 OOLU 3868088匹配,df1“28”的总值与df2的“人工成本”值“28”匹配 df: 预期产出:Pandas 比较熊猫的多个记录,pandas,string-comparison,Pandas,String Comparison,尝试比较df1和df2中的“Cntr No”,df2[“人工成本”、“材料成本”、“估算货币金额”]任一列中的值必须与df1的总额相匹配 例如,df1 OOLU 3868088与df2 OOLU 3868088匹配,df1“28”的总值与df2的“人工成本”值“28”匹配 df: 预期产出: Cntr No Total Tally_with_df2 0 OOLU 3868088 12.0 Yes 1 OOLU 3868088 28.0
Cntr No Total Tally_with_df2
0 OOLU 3868088 12.0 Yes
1 OOLU 3868088 28.0 Yes
2 OOLU 3868088 48.0 Yes
3 TRIU 0625840 119.0 Yes
4 TRIU 0625840 82.5 Yes
5 TRIU 0625840 11.0 No
6 TRIU 1234567 18.0 No
已用代码:这是下面的代码,我尝试过,但无法达到我的要求
cols = ['Labour Cost', 'Material Cost', 'Amount in Estimate Currency']
d = {k: set(v.values()) for k, v in \
df_co.set_index('Cntr No')[cols].to_dict(orient='index').items()}
df['Tally'] = [j in d.get(i, set()) for i, j in zip(df['Cntr No'], df['Total'])]
df['Tally'] = df['Tally'].map({True: 'Yes', False: 'No'})
df1:
df2:
IIUC,我们可以从df2为每个唯一的Cntr编号创建一个
groupby
数据
## this is grouped data
to_remove = df2.select_dtypes(['object']).columns.values.tolist()
df3 = (df2
.groupby('Cntr No')
.apply(lambda df: set(np.concatenate(df.loc[:, df.columns.difference(to_remove)].values))))
## df3 looks like this - using set for faster speed
print(df3)
Cntr No
OOLU 3868088 {0.0, 12.0, 48.0, 87.81, 58.91, 28.0}
TRIU 0625840 {0.0, 12.0, 82.5, 54.0, 119.0}
TRIU 1234567 {16.0, 0.0}
## this function ensures all cases are handles
def get_value(x, data):
if x['Cntr No'] not in data.index:
return 'Not Found'
else:
if x['Total'] in data[x['Cntr No']]:
return 'Yes'
else:
return 'No'
## next we do a simple look-up
df1['Tally_with_df2'] = df1.apply(lambda x: get_value(x, df3), axis=1)
print(df1)
Cntr No Total Tally_with_df2
0 OOLU 3868088 12.0 Yes
1 OOLU 3868088 28.0 Yes
2 OOLU 3868088 48.0 Yes
3 TRIU 0625840 119.0 Yes
4 TRIU 0625840 82.5 Yes
5 TRIU 0625840 11.0 No
6 TRIU 1234567 18.0 No
谢谢,但查找代码中有一个错误:TypeError:“str”对象不能解释为整数。KeyError:('oolu6232016','发生在索引1')@leong我看不到我这边的错误。您可以检查两个数据帧中的值的数据类型或“OOLU 6232016”值。df3示例:TRIU 0783320{GP,70.0,40FL,48.0,118.0}可能我的df2还有两个字符串列吗?@leong确保您在列表
df.columns.difference(['Cntr No','add_string_column'])中添加所有字符串列]
因为最后,我们需要一组所有的数值。我在我的帖子中添加了实际的df1和df2列。我必须全部加进去吗??请看我之前的帖子更新
Cntr No object
Serviced By object
Location object
WO No object
WASH - CHEMICAL float64
PTI - CHILL float64
WASHING CONTAINER AGENT float64
WASH - CHEMICAL AGENT float64
WASHING CONTAINER -AGENT float64
BUNDLING/UNBUNDLING OF FR float64
PTI - AUTO float64
PTI float64
Struct Repair - Labour float64
Struct Repair - Material float64
Machy Repair - Labour float64
Total float64
Vendor object
Sz object
Ty object
CO object
WO Date object
WO ID object
Cntr No object
Equipment Size/type Group Code object
Labour Cost float64
Material Cost float64
Amount in Estimate Currency float64
Remarks object
## this is grouped data
to_remove = df2.select_dtypes(['object']).columns.values.tolist()
df3 = (df2
.groupby('Cntr No')
.apply(lambda df: set(np.concatenate(df.loc[:, df.columns.difference(to_remove)].values))))
## df3 looks like this - using set for faster speed
print(df3)
Cntr No
OOLU 3868088 {0.0, 12.0, 48.0, 87.81, 58.91, 28.0}
TRIU 0625840 {0.0, 12.0, 82.5, 54.0, 119.0}
TRIU 1234567 {16.0, 0.0}
## this function ensures all cases are handles
def get_value(x, data):
if x['Cntr No'] not in data.index:
return 'Not Found'
else:
if x['Total'] in data[x['Cntr No']]:
return 'Yes'
else:
return 'No'
## next we do a simple look-up
df1['Tally_with_df2'] = df1.apply(lambda x: get_value(x, df3), axis=1)
print(df1)
Cntr No Total Tally_with_df2
0 OOLU 3868088 12.0 Yes
1 OOLU 3868088 28.0 Yes
2 OOLU 3868088 48.0 Yes
3 TRIU 0625840 119.0 Yes
4 TRIU 0625840 82.5 Yes
5 TRIU 0625840 11.0 No
6 TRIU 1234567 18.0 No