Sql 如果某人之前购买了相同的产品,或者他购买了任何产品,则标记该人

Sql 如果某人之前购买了相同的产品,或者他购买了任何产品,则标记该人,sql,sql-server,Sql,Sql Server,情况: create table #table1 (email varchar(20), productname varchar(20), datepurchased date) insert into #table1 values ('abc@gmail.com','cucumber','2019-02-01'), ('abc@gmail.com','orange','2019-02-04'), ('abc@gmail.com','grapefruit','2019-02-15'), ('cd

情况:

create table #table1 (email varchar(20), productname varchar(20), datepurchased date)
insert into #table1 values
('abc@gmail.com','cucumber','2019-02-01'),
('abc@gmail.com','orange','2019-02-04'),
('abc@gmail.com','grapefruit','2019-02-15'),
('cde@gmail.com','blackberry','2019-02-06'),
('cde@gmail.com','lime','2019-02-15'),
('cde@gmail.com','lime','2019-02-20'),
('zzz@gmail.com','apple','2019-02-02'),
('zzz@gmail.com','apple','2019-02-18'),
('zzz@gmail.com','orange','2019-02-19'),
('zzz@gmail.com','apple','2019-02-28')
我需要添加两个列标志,标识如下:

  • 此人是否在购买日期之前购买了相同的产品
  • 此人是否在购买日期之前购买了任何其他产品
输出应有5列:

  • 电子邮件
  • 产品名称
  • 购买日期
  • SameProduct(0=否,1=是)
  • AnyProduct(0=否,1=是)
  • 原始数据如下所示:

    abc@gmail.com   cucumber    01-02-2019
    abc@gmail.com   orange      04-02-2019
    abc@gmail.com   grapefruit  15-02-2019
    cde@gmail.com   blackberry  06-02-2019
    cde@gmail.com   lime        15-02-2019
    cde@gmail.com   lime        20-02-2019
    zzz@gmail.com   apple       02-02-2019
    zzz@gmail.com   apple       18-02-2019
    zzz@gmail.com   orange      19-02-2019
    zzz@gmail.com   apple       28-02-2019
    
    Email           ProductName DatePurchased   SameProduct     AnyProduct
    abc@gmail.com   cucumber    01-02-2019      0               0
    abc@gmail.com   orange      04-02-2019      0               1
    abc@gmail.com   grapefruit  15-02-2019      0               1
    cde@gmail.com   blackberry  06-02-2019      0               0
    cde@gmail.com   lime        15-02-2019      0               1
    cde@gmail.com   lime        20-02-2019      1               1
    zzz@gmail.com   apple       02-02-2019      0               0   
    zzz@gmail.com   apple       18-02-2019      1               1   
    zzz@gmail.com   orange      19-02-2019      0               1
    zzz@gmail.com   apple       28-02-2019      1               1
    
    目标:

    create table #table1 (email varchar(20), productname varchar(20), datepurchased date)
    insert into #table1 values
    ('abc@gmail.com','cucumber','2019-02-01'),
    ('abc@gmail.com','orange','2019-02-04'),
    ('abc@gmail.com','grapefruit','2019-02-15'),
    ('cde@gmail.com','blackberry','2019-02-06'),
    ('cde@gmail.com','lime','2019-02-15'),
    ('cde@gmail.com','lime','2019-02-20'),
    ('zzz@gmail.com','apple','2019-02-02'),
    ('zzz@gmail.com','apple','2019-02-18'),
    ('zzz@gmail.com','orange','2019-02-19'),
    ('zzz@gmail.com','apple','2019-02-28')
    
    我的输出将如下所示:

    abc@gmail.com   cucumber    01-02-2019
    abc@gmail.com   orange      04-02-2019
    abc@gmail.com   grapefruit  15-02-2019
    cde@gmail.com   blackberry  06-02-2019
    cde@gmail.com   lime        15-02-2019
    cde@gmail.com   lime        20-02-2019
    zzz@gmail.com   apple       02-02-2019
    zzz@gmail.com   apple       18-02-2019
    zzz@gmail.com   orange      19-02-2019
    zzz@gmail.com   apple       28-02-2019
    
    Email           ProductName DatePurchased   SameProduct     AnyProduct
    abc@gmail.com   cucumber    01-02-2019      0               0
    abc@gmail.com   orange      04-02-2019      0               1
    abc@gmail.com   grapefruit  15-02-2019      0               1
    cde@gmail.com   blackberry  06-02-2019      0               0
    cde@gmail.com   lime        15-02-2019      0               1
    cde@gmail.com   lime        20-02-2019      1               1
    zzz@gmail.com   apple       02-02-2019      0               0   
    zzz@gmail.com   apple       18-02-2019      1               1   
    zzz@gmail.com   orange      19-02-2019      0               1
    zzz@gmail.com   apple       28-02-2019      1               1
    
    我尝试的是: 我曾两次尝试加入到自身和用例语句中,但我觉得这种方式效率极低

    虚拟数据:

    create table #table1 (email varchar(20), productname varchar(20), datepurchased date)
    insert into #table1 values
    ('abc@gmail.com','cucumber','2019-02-01'),
    ('abc@gmail.com','orange','2019-02-04'),
    ('abc@gmail.com','grapefruit','2019-02-15'),
    ('cde@gmail.com','blackberry','2019-02-06'),
    ('cde@gmail.com','lime','2019-02-15'),
    ('cde@gmail.com','lime','2019-02-20'),
    ('zzz@gmail.com','apple','2019-02-02'),
    ('zzz@gmail.com','apple','2019-02-18'),
    ('zzz@gmail.com','orange','2019-02-19'),
    ('zzz@gmail.com','apple','2019-02-28')
    

    注意:我的实际数据超过100万行。我不确定什么类型的查询可以使数据处理尽可能快。

    一种方法是使用
    计数
    窗口功能或
    行数

    --count
    select t.*
           ,case when count(*) over(partition by email,productname order by datepurchased) > 1 then 1 else 0 end as same_prev
           ,case when count(*) over(partition by email order by datepurchased) > 1 then 1 else 0 end as any_prev
    from tbl t
    
    --row_number
    select t.*
               ,case when row_number() over(partition by email,productname order by datepurchased) > 1 then 1 else 0 end as same_prev
               ,case when row_number() over(partition by email order by datepurchased) > 1 then 1 else 0 end as any_prev
    from tbl t
    

    我的解决方案是使用
    LAG()
    ROW\u NUMBER()

    LAG()

    ROW_NUMBER()
    仅用于标记首次购买(ROW NUMBER=1)

    当然,
    partitionby
    ORDER BY
    子句对于以正确的顺序获取记录非常重要

    我还检查了Vamsi Prabhalas的解决方案,但是
    IIF
    的性能似乎比
    CASE-WHEN
    快得多

    SELECT email
          ,productname
          ,datepurchased
          ,IIF(LAG(productname) OVER (PARTITION BY email ORDER BY email, datepurchased) = productname, 1,0) AS SameProduct
          ,IIF(ROW_NUMBER() OVER (PARTITION BY email ORDER BY email, datepurchased) = 1, 0, 1) AS AnyProduct
      FROM #table1
    

    还有一个选项可以得到结果

    我使用ROW_NUMBER()-1,这样我们就可以给第一次出现的值设置为零值。然后我使用SIGN()将任何正值转换为1

    SELECT *,
        SameProduct = SIGN(ROW_NUMBER() OVER(PARTITION BY email, productname ORDER BY datepurchased)-1),
        AnyProduct  = SIGN(ROW_NUMBER() OVER(PARTITION BY email ORDER BY datepurchased)-1)
    FROM #table1
    ORDER BY email, datepurchased;
    
    如果需要,可以将其转换为位,以获得与使用SIGN()相同的结果,但仅在这种情况下,所有值都为正值

    SELECT *,
        SameProduct = CAST(ROW_NUMBER() OVER(PARTITION BY email, productname ORDER BY datepurchased)-1 AS bit),
        AnyProduct  = CAST(ROW_NUMBER() OVER(PARTITION BY email ORDER BY datepurchased)-1 AS bit)
    FROM #table1
    ORDER BY email, datepurchased;
    

    我会使用
    行编号()

    请注意,唯一的区别是
    行号()

    您也可以在不进行比较的情况下执行此操作:

    select t.*,
           coalesce(max(1) over (partition by email, productname order by datepurchased rows between unbounded preceding and 1 preceding), 0) as same_product,
           coalesce(max(1) over (partition by email order by datepurchased rows between unbounded preceding and 1 preceding), 0) as any_product
    from table1 t
    order by email, datepurchased;
    

    是一个dbfiddle。

    对于数百万行来说,这个窗口功能不是非常慢吗?“电子邮件”列上只有一个索引,您可以尝试创建新的索引以包括
    datepurchased
    列,并比较不同的执行计划。如果您通过电子邮件进行分区,则无需通过电子邮件订购。每个分区上的电子邮件将始终相同。@LuisCazares您是对的。只是我的习惯。关于执行计划,应该没有区别吗?你能说明你所说的强制转换是什么意思吗?我编辑了答案,给出了一个如何使用强制转换而不是符号的例子。但是它做什么呢?目的是什么?