Python 3.x 按分组并比较多个列和行_Python 3.x_Pandas_Group By_Compare

Python 3.x 按分组并比较多个列和行

python-3.x pandas

Python 3.x 按分组并比较多个列和行,python-3.x,pandas,group-by,compare,Python 3.x,Pandas,Group By,Compare,我有一个csv，有600多列和数千行。原始文件包含更多的客户和部门，但这包括关键部分注：我从A_Loc1和B_Loc1列中导出了Site列，以便更容易地比较和分组行，但这不是一个要求。如果groupby可以在没有此功能的情况下执行，我愿意接受其他方法我需要根据Cust\u ID和站点比较不同行和列的日期。因此，例如，确认A_Date1小于B_Date1，但仅适用于相同的Cust\u ID和Site值因此，对于Cust\u ID100和SiteCA2.2，A\u日期1是8/1/2015，B\

我有一个

csv

，有600多列和数千行。原始文件包含更多的客户和部门，但这包括关键部分

注：我从

A_Loc1

和

B_Loc1

列中导出了

Site

列，以便更容易地比较和分组行，但这不是一个要求。如果groupby可以在没有此功能的情况下执行，我愿意接受其他方法

我需要根据

Cust\u ID

和

站点

比较不同行和列的日期。因此，例如，确认

A_Date1

小于

B_Date1

，但仅适用于相同的

Cust\u ID

和

Site

值

因此，对于

Cust\u ID

和

Site

CA2.2

，

A\u日期1

是

8/1/2015

，

B\u日期1

是

6/15/2018

：

if A_Date1 > B_Date1:
     df['Result'] = "Fail"
 else:
     result = ""

在上述情况下，不需要执行任何操作，因为

A_Date1

小于

B_Date1

但是，对于

Cust\u ID

和

Site

CA2.0

，

A\u日期1

为

2019年7月1日

，

B\u日期1

为

2018年12月15日

，因此

Site

为

的Dep B

行的

结果列应为Fail

我愿意使用任何高效、灵活的方法来执行此操作，但是，我还需要对不同的行和列执行其他比较，但这应该让我开始
预期结果：
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
|    | Result   |   Cust_ID | Dep   |   Order_Num | Site   | Rec_Date1   | Rec_DateX   | A_Date1   | A_Loc1   | A_DateX   | B_Loc1   | B_Date1    | B_Date2    | B_DateX   | C_Date1    | C_Loc1   | C_DateX   |
|----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------|
|  0 |          |       100 | A     |           1 | CA2.2  |             |             | 8/1/2015  | CA2.2    |           |          |            |            |           |            |          |           |
|  1 |          |       100 | A     |           2 | CA2.0  |             |             | 7/1/2019  | CA2.0    | 8/21/2019 |          |            |            |           |            |          |           |
|  2 |          |       100 | B     |           1 | CA2.2  |             |             |           |          |           | CA2.2    | 6/15/2018  | 6/15/2016  | 8/1/2019  |            |          |           |
|  3 | Fail     |       100 | B     |           2 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 |           |            |          |           |
|  4 | Fail     |       100 | B     |           3 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 | 8/21/2019 |            |          |           |
|  5 |          |       100 | C     |           1 | CA2.2  |             |             |           |          |           |          |            |            |           | 6/15/2016  | CA2.2    |           |
|  6 |          |       100 | C     |           2 | CA2.0  |             |             |           |          |           |          |            |            |           | 12/15/2017 | CA2.0    | 8/21/2019 |
|  7 |          |       100 | Rec   |             |        | 6/12/2019   | 8/1/2019    |           |          |           |          |            |            |           |            |          |           |
|  8 |          |       200 | A     |           1 | CA2.2  |             |             | 8/1/2015  | CA2.2    |           |          |            |            |           |            |          |           |
|  9 |          |       200 | A     |           2 | CA2.0  |             |             | 7/1/2015  | CA2.0    | 8/21/2019 |          |            |            |           |            |          |           |
| 10 |          |       200 | B     |           1 | CA2.2  |             |             |           |          |           | CA2.2    | 6/15/2018  | 6/15/2016  | 8/1/2019  |            |          |           |
| 11 |          |       200 | B     |           2 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 |           |            |          |           |
| 12 |          |       200 | B     |           3 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 | 8/21/2019 |            |          |           |
| 13 |          |       200 | C     |           1 | CA2.2  |             |             |           |          |           |          |            |            |           | 6/15/2016  | CA2.2    |           |
| 14 |          |       200 | C     |           2 | CA2.0  |             |             |           |          |           |          |            |            |           | 12/15/2017 | CA2.0    | 8/21/2019 |
| 15 |          |       200 | Rec   |             |        | 6/12/2019   | 8/1/2019    |           |          |           |          |            |            |           |            |          |           |
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+

# Returns: ValueError: Length of values does not match length of index
df['Result'] = df.loc[df.A_Date1 < df.B_Date1].groupby(['Cust_ID','Site'],as_index=False)

# Returns: ValueError: Length of values does not match length of index
df["Result"] = df.loc[(((df["A_Date1"] != "N/A") 
               & (df["B_Date1"] != "N/A"))
               & (df.A_Date1 < df.B_Date1))].groupby([
               'Cust_ID','Site'],as_index=False)

# Returns: ValueError: unknown type str224
conditions = "(x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1'])"
df["Result"] = df.groupby(['Cust_ID','Site']).apply(lambda x: pd.eval(conditions))

# TypeError: incompatible index of inserted column with frame index
df = df[df.Dep != 'Rec']
df['Result'] = df.groupby(['Cust_ID','Site'],as_index = False).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))

# This produces FALSE for all rows
grouped_df = df.groupby(['Cust_ID','Site']).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))

我所尝试的：
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
|    | Result   |   Cust_ID | Dep   |   Order_Num | Site   | Rec_Date1   | Rec_DateX   | A_Date1   | A_Loc1   | A_DateX   | B_Loc1   | B_Date1    | B_Date2    | B_DateX   | C_Date1    | C_Loc1   | C_DateX   |
|----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------|
|  0 |          |       100 | A     |           1 | CA2.2  |             |             | 8/1/2015  | CA2.2    |           |          |            |            |           |            |          |           |
|  1 |          |       100 | A     |           2 | CA2.0  |             |             | 7/1/2019  | CA2.0    | 8/21/2019 |          |            |            |           |            |          |           |
|  2 |          |       100 | B     |           1 | CA2.2  |             |             |           |          |           | CA2.2    | 6/15/2018  | 6/15/2016  | 8/1/2019  |            |          |           |
|  3 | Fail     |       100 | B     |           2 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 |           |            |          |           |
|  4 | Fail     |       100 | B     |           3 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 | 8/21/2019 |            |          |           |
|  5 |          |       100 | C     |           1 | CA2.2  |             |             |           |          |           |          |            |            |           | 6/15/2016  | CA2.2    |           |
|  6 |          |       100 | C     |           2 | CA2.0  |             |             |           |          |           |          |            |            |           | 12/15/2017 | CA2.0    | 8/21/2019 |
|  7 |          |       100 | Rec   |             |        | 6/12/2019   | 8/1/2019    |           |          |           |          |            |            |           |            |          |           |
|  8 |          |       200 | A     |           1 | CA2.2  |             |             | 8/1/2015  | CA2.2    |           |          |            |            |           |            |          |           |
|  9 |          |       200 | A     |           2 | CA2.0  |             |             | 7/1/2015  | CA2.0    | 8/21/2019 |          |            |            |           |            |          |           |
| 10 |          |       200 | B     |           1 | CA2.2  |             |             |           |          |           | CA2.2    | 6/15/2018  | 6/15/2016  | 8/1/2019  |            |          |           |
| 11 |          |       200 | B     |           2 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 |           |            |          |           |
| 12 |          |       200 | B     |           3 | CA2.0  |             |             |           |          |           | CA2.0    | 12/15/2018 | 12/15/2016 | 8/21/2019 |            |          |           |
| 13 |          |       200 | C     |           1 | CA2.2  |             |             |           |          |           |          |            |            |           | 6/15/2016  | CA2.2    |           |
| 14 |          |       200 | C     |           2 | CA2.0  |             |             |           |          |           |          |            |            |           | 12/15/2017 | CA2.0    | 8/21/2019 |
| 15 |          |       200 | Rec   |             |        | 6/12/2019   | 8/1/2019    |           |          |           |          |            |            |           |            |          |           |
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+

# Returns: ValueError: Length of values does not match length of index
df['Result'] = df.loc[df.A_Date1 < df.B_Date1].groupby(['Cust_ID','Site'],as_index=False)

# Returns: ValueError: Length of values does not match length of index
df["Result"] = df.loc[(((df["A_Date1"] != "N/A") 
               & (df["B_Date1"] != "N/A"))
               & (df.A_Date1 < df.B_Date1))].groupby([
               'Cust_ID','Site'],as_index=False)

# Returns: ValueError: unknown type str224
conditions = "(x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1'])"
df["Result"] = df.groupby(['Cust_ID','Site']).apply(lambda x: pd.eval(conditions))

# TypeError: incompatible index of inserted column with frame index
df = df[df.Dep != 'Rec']
df['Result'] = df.groupby(['Cust_ID','Site'],as_index = False).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))

# This produces FALSE for all rows
grouped_df = df.groupby(['Cust_ID','Site']).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))

我们将发布此解决方案，希望找到一个更优雅、更可扩展的实现
import pandas as pd
import numpy as np
import os

data = [[100,'A','1','','','8/1/2015','CA2.2','','','','','','','',''],
        [100,'A','2','','','7/1/2019','CA2.0','8/21/2019','','','','','','',''],
        [100,'B','1','','','','','','CA2.2','6/15/2018','6/15/2016','8/1/2019','','',''],
        [100,'B','2','','','','','','CA2.0','12/15/2018','12/15/2016','','','',''],       
        [100,'B','3','','','','','','CA2.0','12/15/2018','12/15/2016','8/21/2019','','',''],
        [100,'C','1','','','','','','','','','','6/15/2016','CA2.2',''],
        [100,'C','2','','','','','','','','','','12/15/2017','CA2.0','8/21/2019'],
        [100,'Rec','','6/12/2019','8/1/2019','','','','','','','','','',''],
        [200,'A','1','','','8/1/2015','CA2.2','','','','','','','',''],
        [200,'A','2','','','7/1/2015','CA2.0','8/21/2019','','','','','','',''],
        [200,'B','1','','','','','','CA2.2','6/15/2018','6/15/2016','8/1/2019','','',''],
        [200,'B','2','','','','','','CA2.0','12/15/2018','12/15/2016','','','',''],       
        [200,'B','3','','','','','','CA2.0','12/15/2018','12/15/2016','8/21/2019','','',''],
        [200,'C','1','','','','','','','','','','6/15/2016','CA2.2',''],
        [200,'C','2','','','','','','','','','','12/15/2017','CA2.0','8/21/2019'],
        [200,'Rec','','6/12/2019','8/1/2019','','','','','','','','','','']]

df = pd.DataFrame(data,columns=['Cust_ID','Dep','Order_Num','Rec_Date1',
                                'Rec_DateX','A_Date1','A_Loc1','A_DateX',
                                'B_Loc1','B_Date1','B_Date2','B_DateX',
                                'C_Date1','C_Loc1','C_DateX'])

# replace blanks with np.NaN
df.replace(r"^s*$", np.nan, regex=True, inplace = True)

## Convert all date columns to datetime, replace with NaN if error
df['A_Date1'] = pd.to_datetime(df['A_Date1'], errors ="coerce")
df['B_Date1'] = pd.to_datetime(df['B_Date1'], errors ="coerce")


# Add Site and Result column
df.insert(loc=4, column="Site", value=np.nan)
df.insert(loc=0, column="Result", value=np.nan)

# Populate Site column based on related column
df.loc[df["A_Loc1"].notna(), 
       "Site"] = df["A_Loc1"]

df.loc[df["B_Loc1"].notna(), 
       "Site"] = df["B_Loc1"]

df.loc[df["C_Loc1"].notna(), 
       "Site"] = df["C_Loc1"]

# groupby Cust_ID and Site, and fill A_Date1 forward and back
df['A_Date1'] = df.groupby(['Cust_ID','Site'], sort=False)['A_Date1'].apply(lambda x: x.ffill().bfill())

# Perform comparison
df.loc[(((df["A_Date1"].notna()) & (df["B_Date1"].notna()))
        & ((df["A_Date1"]) > (df["B_Date1"]))), 
       "Result"] = "Fail"

对于Cust_ID 100和站点CA2.2，如何将B_日期1设置为2019年7月1日？这不是网站CA2.0的日期，还是我读错了？@Dan，谢谢。那是个打字错误。刚刚更正了。你说的50列是什么意思？@Ben.T，正如我提到的，我还需要进行其他比较。我不是在写所有的列，而是在寻找一种有效的方法。@m8_u得到了这一部分，但是你的意思是说其他的比较，仍然是日期之间的比较，并且列的名称有一个模式吗？