Pandas 比较熊猫的多个记录_Pandas_String Comparison

Pandas 比较熊猫的多个记录

pandas

Pandas 比较熊猫的多个记录,pandas,string-comparison,Pandas,String Comparison,尝试比较df1和df2中的“Cntr No”，df2[“人工成本”、“材料成本”、“估算货币金额”]任一列中的值必须与df1的总额相匹配例如，df1 OOLU 3868088与df2 OOLU 3868088匹配，df1“28”的总值与df2的“人工成本”值“28”匹配 df: 预期产出： Cntr No Total Tally_with_df2 0 OOLU 3868088 12.0 Yes 1 OOLU 3868088 28.0

尝试比较df1和df2中的“Cntr No”，df2[“人工成本”、“材料成本”、“估算货币金额”]任一列中的值必须与df1的总额相匹配

例如，df1 OOLU 3868088与df2 OOLU 3868088匹配，df1“28”的总值与df2的“人工成本”值“28”匹配

df:

预期产出：

    Cntr No        Total    Tally_with_df2
0   OOLU 3868088    12.0    Yes
1   OOLU 3868088    28.0    Yes
2   OOLU 3868088    48.0    Yes
3   TRIU 0625840    119.0   Yes
4   TRIU 0625840    82.5    Yes
5   TRIU 0625840    11.0    No
6   TRIU 1234567    18.0    No

已用代码：这是下面的代码，我尝试过，但无法达到我的要求

cols = ['Labour Cost', 'Material Cost', 'Amount in Estimate Currency']

 d = {k: set(v.values()) for k, v in \
    df_co.set_index('Cntr No')[cols].to_dict(orient='index').items()}

df['Tally'] = [j in d.get(i, set()) for i, j in zip(df['Cntr No'], df['Total'])]
df['Tally'] = df['Tally'].map({True: 'Yes', False: 'No'})

df1：

df2：

IIUC，我们可以从df2为每个唯一的Cntr编号创建一个

groupby

数据

## this is grouped data
to_remove = df2.select_dtypes(['object']).columns.values.tolist()

df3 = (df2
.groupby('Cntr No')
.apply(lambda df: set(np.concatenate(df.loc[:, df.columns.difference(to_remove)].values))))

## df3 looks like this - using set for faster speed
print(df3)

Cntr No
OOLU 3868088    {0.0, 12.0, 48.0, 87.81, 58.91, 28.0}
TRIU 0625840           {0.0, 12.0, 82.5, 54.0, 119.0}
TRIU 1234567                              {16.0, 0.0}


## this function ensures all cases are handles
def get_value(x, data):
    if x['Cntr No'] not in data.index:
        return 'Not Found'
    else:
        if x['Total'] in data[x['Cntr No']]:
            return 'Yes'
        else:
            return 'No'

## next we do a simple look-up
df1['Tally_with_df2'] = df1.apply(lambda x: get_value(x, df3), axis=1)

print(df1)

        Cntr No  Total Tally_with_df2
0  OOLU 3868088   12.0            Yes
1  OOLU 3868088   28.0            Yes
2  OOLU 3868088   48.0            Yes
3  TRIU 0625840  119.0            Yes
4  TRIU 0625840   82.5            Yes
5  TRIU 0625840   11.0             No
6  TRIU 1234567   18.0             No

谢谢，但查找代码中有一个错误：TypeError:“str”对象不能解释为整数。KeyError:（'oolu6232016'，'发生在索引1'）@leong我看不到我这边的错误。您可以检查两个数据帧中的值的数据类型或“OOLU 6232016”值。df3示例：TRIU 0783320{GP，70.0，40FL，48.0，118.0}可能我的df2还有两个字符串列吗？@leong确保您在列表

df.columns.difference（['Cntr No'，'add_string_column']）中添加所有字符串列]

因为最后，我们需要一组所有的数值。我在我的帖子中添加了实际的df1和df2列。我必须全部加进去吗？？请看我之前的帖子更新

Cntr No                       object
Serviced By                   object
Location                      object
WO No                         object
WASH - CHEMICAL              float64
PTI - CHILL                  float64
WASHING CONTAINER AGENT      float64
WASH - CHEMICAL AGENT        float64
WASHING CONTAINER -AGENT     float64
BUNDLING/UNBUNDLING OF FR    float64
PTI - AUTO                   float64
PTI                          float64
Struct Repair - Labour       float64
Struct Repair - Material     float64
Machy Repair - Labour        float64
Total                        float64
Vendor                        object
Sz                            object
Ty                            object
CO                            object
WO Date                       object
WO ID                         object

 Cntr No                            object
Equipment Size/type Group Code     object
Labour Cost                       float64
Material Cost                     float64
Amount in Estimate Currency       float64
Remarks                            object

## this is grouped data
to_remove = df2.select_dtypes(['object']).columns.values.tolist()

df3 = (df2
.groupby('Cntr No')
.apply(lambda df: set(np.concatenate(df.loc[:, df.columns.difference(to_remove)].values))))

## df3 looks like this - using set for faster speed
print(df3)

Cntr No
OOLU 3868088    {0.0, 12.0, 48.0, 87.81, 58.91, 28.0}
TRIU 0625840           {0.0, 12.0, 82.5, 54.0, 119.0}
TRIU 1234567                              {16.0, 0.0}


## this function ensures all cases are handles
def get_value(x, data):
    if x['Cntr No'] not in data.index:
        return 'Not Found'
    else:
        if x['Total'] in data[x['Cntr No']]:
            return 'Yes'
        else:
            return 'No'

## next we do a simple look-up
df1['Tally_with_df2'] = df1.apply(lambda x: get_value(x, df3), axis=1)

print(df1)

        Cntr No  Total Tally_with_df2
0  OOLU 3868088   12.0            Yes
1  OOLU 3868088   28.0            Yes
2  OOLU 3868088   48.0            Yes
3  TRIU 0625840  119.0            Yes
4  TRIU 0625840   82.5            Yes
5  TRIU 0625840   11.0             No
6  TRIU 1234567   18.0             No