Python 3.x 按分组并比较多个列和行
我有一个Python 3.x 按分组并比较多个列和行,python-3.x,pandas,group-by,compare,Python 3.x,Pandas,Group By,Compare,我有一个csv,有600多列和数千行。原始文件包含更多的客户和部门,但这包括关键部分 注:我从A_Loc1和B_Loc1列中导出了Site列,以便更容易地比较和分组行,但这不是一个要求。如果groupby可以在没有此功能的情况下执行,我愿意接受其他方法 我需要根据Cust\u ID和站点比较不同行和列的日期。因此,例如,确认A_Date1小于B_Date1,但仅适用于相同的Cust\u ID和Site值 因此,对于Cust\u ID100和SiteCA2.2,A\u日期1是8/1/2015,B\
csv
,有600多列和数千行。原始文件包含更多的客户和部门,但这包括关键部分
注:我从A_Loc1
和B_Loc1
列中导出了Site
列,以便更容易地比较和分组行,但这不是一个要求。如果groupby可以在没有此功能的情况下执行,我愿意接受其他方法
我需要根据Cust\u ID
和站点
比较不同行和列的日期。因此,例如,确认A_Date1
小于B_Date1
,但仅适用于相同的Cust\u ID
和Site
值
因此,对于Cust\u ID
100
和Site
CA2.2
,A\u日期1
是8/1/2015
,B\u日期1
是6/15/2018
:
if A_Date1 > B_Date1:
df['Result'] = "Fail"
else:
result = ""
在上述情况下,不需要执行任何操作,因为A_Date1
小于B_Date1
但是,对于Cust\u ID
100
和Site
CA2.0
,A\u日期1
为2019年7月1日
,B\u日期1
为2018年12月15日
,因此Site
为的Dep B
行的结果列应为Fail
我愿意使用任何高效、灵活的方法来执行此操作,但是,我还需要对不同的行和列执行其他比较,但这应该让我开始
预期结果:
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
| | Result | Cust_ID | Dep | Order_Num | Site | Rec_Date1 | Rec_DateX | A_Date1 | A_Loc1 | A_DateX | B_Loc1 | B_Date1 | B_Date2 | B_DateX | C_Date1 | C_Loc1 | C_DateX |
|----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------|
| 0 | | 100 | A | 1 | CA2.2 | | | 8/1/2015 | CA2.2 | | | | | | | | |
| 1 | | 100 | A | 2 | CA2.0 | | | 7/1/2019 | CA2.0 | 8/21/2019 | | | | | | | |
| 2 | | 100 | B | 1 | CA2.2 | | | | | | CA2.2 | 6/15/2018 | 6/15/2016 | 8/1/2019 | | | |
| 3 | Fail | 100 | B | 2 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | | | | |
| 4 | Fail | 100 | B | 3 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | 8/21/2019 | | | |
| 5 | | 100 | C | 1 | CA2.2 | | | | | | | | | | 6/15/2016 | CA2.2 | |
| 6 | | 100 | C | 2 | CA2.0 | | | | | | | | | | 12/15/2017 | CA2.0 | 8/21/2019 |
| 7 | | 100 | Rec | | | 6/12/2019 | 8/1/2019 | | | | | | | | | | |
| 8 | | 200 | A | 1 | CA2.2 | | | 8/1/2015 | CA2.2 | | | | | | | | |
| 9 | | 200 | A | 2 | CA2.0 | | | 7/1/2015 | CA2.0 | 8/21/2019 | | | | | | | |
| 10 | | 200 | B | 1 | CA2.2 | | | | | | CA2.2 | 6/15/2018 | 6/15/2016 | 8/1/2019 | | | |
| 11 | | 200 | B | 2 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | | | | |
| 12 | | 200 | B | 3 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | 8/21/2019 | | | |
| 13 | | 200 | C | 1 | CA2.2 | | | | | | | | | | 6/15/2016 | CA2.2 | |
| 14 | | 200 | C | 2 | CA2.0 | | | | | | | | | | 12/15/2017 | CA2.0 | 8/21/2019 |
| 15 | | 200 | Rec | | | 6/12/2019 | 8/1/2019 | | | | | | | | | | |
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
# Returns: ValueError: Length of values does not match length of index
df['Result'] = df.loc[df.A_Date1 < df.B_Date1].groupby(['Cust_ID','Site'],as_index=False)
# Returns: ValueError: Length of values does not match length of index
df["Result"] = df.loc[(((df["A_Date1"] != "N/A")
& (df["B_Date1"] != "N/A"))
& (df.A_Date1 < df.B_Date1))].groupby([
'Cust_ID','Site'],as_index=False)
# Returns: ValueError: unknown type str224
conditions = "(x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1'])"
df["Result"] = df.groupby(['Cust_ID','Site']).apply(lambda x: pd.eval(conditions))
# TypeError: incompatible index of inserted column with frame index
df = df[df.Dep != 'Rec']
df['Result'] = df.groupby(['Cust_ID','Site'],as_index = False).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))
# This produces FALSE for all rows
grouped_df = df.groupby(['Cust_ID','Site']).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))
我所尝试的:
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
| | Result | Cust_ID | Dep | Order_Num | Site | Rec_Date1 | Rec_DateX | A_Date1 | A_Loc1 | A_DateX | B_Loc1 | B_Date1 | B_Date2 | B_DateX | C_Date1 | C_Loc1 | C_DateX |
|----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------|
| 0 | | 100 | A | 1 | CA2.2 | | | 8/1/2015 | CA2.2 | | | | | | | | |
| 1 | | 100 | A | 2 | CA2.0 | | | 7/1/2019 | CA2.0 | 8/21/2019 | | | | | | | |
| 2 | | 100 | B | 1 | CA2.2 | | | | | | CA2.2 | 6/15/2018 | 6/15/2016 | 8/1/2019 | | | |
| 3 | Fail | 100 | B | 2 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | | | | |
| 4 | Fail | 100 | B | 3 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | 8/21/2019 | | | |
| 5 | | 100 | C | 1 | CA2.2 | | | | | | | | | | 6/15/2016 | CA2.2 | |
| 6 | | 100 | C | 2 | CA2.0 | | | | | | | | | | 12/15/2017 | CA2.0 | 8/21/2019 |
| 7 | | 100 | Rec | | | 6/12/2019 | 8/1/2019 | | | | | | | | | | |
| 8 | | 200 | A | 1 | CA2.2 | | | 8/1/2015 | CA2.2 | | | | | | | | |
| 9 | | 200 | A | 2 | CA2.0 | | | 7/1/2015 | CA2.0 | 8/21/2019 | | | | | | | |
| 10 | | 200 | B | 1 | CA2.2 | | | | | | CA2.2 | 6/15/2018 | 6/15/2016 | 8/1/2019 | | | |
| 11 | | 200 | B | 2 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | | | | |
| 12 | | 200 | B | 3 | CA2.0 | | | | | | CA2.0 | 12/15/2018 | 12/15/2016 | 8/21/2019 | | | |
| 13 | | 200 | C | 1 | CA2.2 | | | | | | | | | | 6/15/2016 | CA2.2 | |
| 14 | | 200 | C | 2 | CA2.0 | | | | | | | | | | 12/15/2017 | CA2.0 | 8/21/2019 |
| 15 | | 200 | Rec | | | 6/12/2019 | 8/1/2019 | | | | | | | | | | |
+----+----------+-----------+-------+-------------+--------+-------------+-------------+-----------+----------+-----------+----------+------------+------------+-----------+------------+----------+-----------+
# Returns: ValueError: Length of values does not match length of index
df['Result'] = df.loc[df.A_Date1 < df.B_Date1].groupby(['Cust_ID','Site'],as_index=False)
# Returns: ValueError: Length of values does not match length of index
df["Result"] = df.loc[(((df["A_Date1"] != "N/A")
& (df["B_Date1"] != "N/A"))
& (df.A_Date1 < df.B_Date1))].groupby([
'Cust_ID','Site'],as_index=False)
# Returns: ValueError: unknown type str224
conditions = "(x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1'])"
df["Result"] = df.groupby(['Cust_ID','Site']).apply(lambda x: pd.eval(conditions))
# TypeError: incompatible index of inserted column with frame index
df = df[df.Dep != 'Rec']
df['Result'] = df.groupby(['Cust_ID','Site'],as_index = False).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))
# This produces FALSE for all rows
grouped_df = df.groupby(['Cust_ID','Site']).apply(lambda x: (x['A_Date1'].notna()) & (x['B_Date1'].notna()) & (x['A_Date1'] < x['B_Date1']))
我们将发布此解决方案,希望找到一个更优雅、更可扩展的实现
import pandas as pd
import numpy as np
import os
data = [[100,'A','1','','','8/1/2015','CA2.2','','','','','','','',''],
[100,'A','2','','','7/1/2019','CA2.0','8/21/2019','','','','','','',''],
[100,'B','1','','','','','','CA2.2','6/15/2018','6/15/2016','8/1/2019','','',''],
[100,'B','2','','','','','','CA2.0','12/15/2018','12/15/2016','','','',''],
[100,'B','3','','','','','','CA2.0','12/15/2018','12/15/2016','8/21/2019','','',''],
[100,'C','1','','','','','','','','','','6/15/2016','CA2.2',''],
[100,'C','2','','','','','','','','','','12/15/2017','CA2.0','8/21/2019'],
[100,'Rec','','6/12/2019','8/1/2019','','','','','','','','','',''],
[200,'A','1','','','8/1/2015','CA2.2','','','','','','','',''],
[200,'A','2','','','7/1/2015','CA2.0','8/21/2019','','','','','','',''],
[200,'B','1','','','','','','CA2.2','6/15/2018','6/15/2016','8/1/2019','','',''],
[200,'B','2','','','','','','CA2.0','12/15/2018','12/15/2016','','','',''],
[200,'B','3','','','','','','CA2.0','12/15/2018','12/15/2016','8/21/2019','','',''],
[200,'C','1','','','','','','','','','','6/15/2016','CA2.2',''],
[200,'C','2','','','','','','','','','','12/15/2017','CA2.0','8/21/2019'],
[200,'Rec','','6/12/2019','8/1/2019','','','','','','','','','','']]
df = pd.DataFrame(data,columns=['Cust_ID','Dep','Order_Num','Rec_Date1',
'Rec_DateX','A_Date1','A_Loc1','A_DateX',
'B_Loc1','B_Date1','B_Date2','B_DateX',
'C_Date1','C_Loc1','C_DateX'])
# replace blanks with np.NaN
df.replace(r"^s*$", np.nan, regex=True, inplace = True)
## Convert all date columns to datetime, replace with NaN if error
df['A_Date1'] = pd.to_datetime(df['A_Date1'], errors ="coerce")
df['B_Date1'] = pd.to_datetime(df['B_Date1'], errors ="coerce")
# Add Site and Result column
df.insert(loc=4, column="Site", value=np.nan)
df.insert(loc=0, column="Result", value=np.nan)
# Populate Site column based on related column
df.loc[df["A_Loc1"].notna(),
"Site"] = df["A_Loc1"]
df.loc[df["B_Loc1"].notna(),
"Site"] = df["B_Loc1"]
df.loc[df["C_Loc1"].notna(),
"Site"] = df["C_Loc1"]
# groupby Cust_ID and Site, and fill A_Date1 forward and back
df['A_Date1'] = df.groupby(['Cust_ID','Site'], sort=False)['A_Date1'].apply(lambda x: x.ffill().bfill())
# Perform comparison
df.loc[(((df["A_Date1"].notna()) & (df["B_Date1"].notna()))
& ((df["A_Date1"]) > (df["B_Date1"]))),
"Result"] = "Fail"
对于Cust_ID 100和站点CA2.2,如何将B_日期1设置为2019年7月1日?这不是网站CA2.0的日期,还是我读错了?@Dan,谢谢。那是个打字错误。刚刚更正了。你说的50列是什么意思?@Ben.T,正如我提到的,我还需要进行其他比较。我不是在写所有的列,而是在寻找一种有效的方法。@m8_u得到了这一部分,但是你的意思是说其他的比较,仍然是日期之间的比较,并且列的名称有一个模式吗?