Python 查找客户id';这是以前没有的

Python 查找客户id';这是以前没有的,python,pandas,Python,Pandas,我想筛选以前数据中不存在的customer\u id,因此所有new\u customer\u id在2020-01-10上是新的,在2020-01-01上不存在 主楼 date customer_id amount_spent 2020-01-01 24 123 2020-01-10 24 145 2020-01-01 58 89 2020-01-10 58

我想筛选以前数据中不存在的
customer\u id
,因此所有
new\u customer\u id
在2020-01-10上是新的,在2020-01-01上不存在

主楼

date          customer_id   amount_spent 
2020-01-01    24            123
2020-01-10    24            145
2020-01-01    58             89
2020-01-10    58             67
2020-01-01    98             34
2020-01-10    99             86
2020-01-10    67            140
2020-01-10    32            321
2020-01-10    75             76
输出功率

new_customer_id  amount_spent 
32           321
75            76
67           140
我曾尝试在Pandas中使用shift功能,但这对我不起作用

编辑

df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])

IIUC您可以获取
2020-01-01
中的
customer\u id
,然后将其过滤掉:

s = df.loc[df["date"]=="2020-01-01", "customer_id"]

print (df[~df["customer_id"].isin(s)])

         date  customer_id  amount_spent
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76

这是另一个解决方案

import numpy as np

mask = df.groupby('customer_id').transform(np.size).eq(1)

    date  amount_spent
0  False         False
1  False         False
2  False         False
3  False         False
4   True          True
5   True          True
6   True          True
7   True          True
8   True          True

df[mask['date'] & df.date.eq('2020-01-10')]


假设您的示例中有一个输入错误(99是98)。您可以执行以下操作:

df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])

df["order"] = df.groupby("customer_id").cumcount()

df[(df["date"] == "2020-01-10") & (df["order_x"]==0)]
输出:

    date        customer_id amount_spent    order_x order_y
6   2020-01-10  67          140             0       0
7   2020-01-10  32          321             0       0
8   2020-01-10  75          76              0       0

根据df的复杂程度,需要对其进行编辑。

这符合您的要求。不确定您的示例数据和输出是否如您所想。我将2020-01-10的客户99改为98

  • 创建一个掩码,该掩码是所需日期之前/之后的行
  • 选择切换日期之后(包括切换日期)的行,减去切换日期之前存在的客户
    isin()

  • 如果您正在寻找通用方法,这可能是一个解决方案:

    df = pd.DataFrame({
        'date':['2020-01-01','2020-01-10','2020-01-01','2020-01-10','2020-01-01','2020-01-10','2020-01-10','2020-01-10','2020-01-10'],
        'customer_id':[24,24,58,58,98,99,67,32,75],
        'amount_spent':[123,145,89,67,34,86,140,321,76]
    })
    print(df)
             date  customer_id  amount_spent
    0  2020-01-01           24           123
    1  2020-01-10           24           145
    2  2020-01-01           58            89
    3  2020-01-10           58            67
    4  2020-01-01           98            34
    5  2020-01-10           99            86
    6  2020-01-10           67           140
    7  2020-01-10           32           321
    8  2020-01-10           75            76
    
    您正在查找最后两个日期,因为您的数据集可能看起来不同,并且您不知道要查找的日期。 所以现在你应该找到最后两次约会

    df=df.sort_values(by='date')
    take_last_dates = df.drop_duplicates(subset='date').sort_values(by='date')
    take_last_dates = take_last_dates.date.tolist()
    print(take_last_dates)
    ['2020-01-01', '2020-01-10']
    
    现在,您需要为这两个日期创建两个DF,以查看客户之间的差异:

    df_prev = df[
        df.date==take_last_dates[0]
    ]
    print(df_prev)
             date  customer_id  amount_spent
    0  2020-01-01           24           123
    2  2020-01-01           58            89
    4  2020-01-01           98            34
    df_current = df[
        df.date==take_last_dates[1]
    ]
    
    print(df_current)
             date  customer_id  amount_spent
    1  2020-01-10           24           145
    3  2020-01-10           58            67
    5  2020-01-10           99            86
    6  2020-01-10           67           140
    7  2020-01-10           32           321
    8  2020-01-10           75            76
    
    因此,最后您可以使用这两个df获得结果:

    new_customers = df_current[
        ~df_current.customer_id.isin(df_prev.customer_id.tolist())
    ]
    
    print(new_customers)
             date  customer_id  amount_spent
    5  2020-01-10           99            86
    6  2020-01-10           67           140
    7  2020-01-10           32           321
    8  2020-01-10           75            76
    

    这回答了你的问题吗?否,因为我只需要筛选上一个日期中不存在的行
    df=df.sort_values(by='date')
    take_last_dates = df.drop_duplicates(subset='date').sort_values(by='date')
    take_last_dates = take_last_dates.date.tolist()
    print(take_last_dates)
    ['2020-01-01', '2020-01-10']
    
    df_prev = df[
        df.date==take_last_dates[0]
    ]
    print(df_prev)
             date  customer_id  amount_spent
    0  2020-01-01           24           123
    2  2020-01-01           58            89
    4  2020-01-01           98            34
    df_current = df[
        df.date==take_last_dates[1]
    ]
    
    print(df_current)
             date  customer_id  amount_spent
    1  2020-01-10           24           145
    3  2020-01-10           58            67
    5  2020-01-10           99            86
    6  2020-01-10           67           140
    7  2020-01-10           32           321
    8  2020-01-10           75            76
    
    new_customers = df_current[
        ~df_current.customer_id.isin(df_prev.customer_id.tolist())
    ]
    
    print(new_customers)
             date  customer_id  amount_spent
    5  2020-01-10           99            86
    6  2020-01-10           67           140
    7  2020-01-10           32           321
    8  2020-01-10           75            76