Sql 如果某人之前购买了相同的产品,或者他购买了任何产品,则标记该人
情况:Sql 如果某人之前购买了相同的产品,或者他购买了任何产品,则标记该人,sql,sql-server,Sql,Sql Server,情况: create table #table1 (email varchar(20), productname varchar(20), datepurchased date) insert into #table1 values ('abc@gmail.com','cucumber','2019-02-01'), ('abc@gmail.com','orange','2019-02-04'), ('abc@gmail.com','grapefruit','2019-02-15'), ('cd
create table #table1 (email varchar(20), productname varchar(20), datepurchased date)
insert into #table1 values
('abc@gmail.com','cucumber','2019-02-01'),
('abc@gmail.com','orange','2019-02-04'),
('abc@gmail.com','grapefruit','2019-02-15'),
('cde@gmail.com','blackberry','2019-02-06'),
('cde@gmail.com','lime','2019-02-15'),
('cde@gmail.com','lime','2019-02-20'),
('zzz@gmail.com','apple','2019-02-02'),
('zzz@gmail.com','apple','2019-02-18'),
('zzz@gmail.com','orange','2019-02-19'),
('zzz@gmail.com','apple','2019-02-28')
我需要添加两个列标志,标识如下:
- 此人是否在购买日期之前购买了相同的产品
- 此人是否在购买日期之前购买了任何其他产品
abc@gmail.com cucumber 01-02-2019
abc@gmail.com orange 04-02-2019
abc@gmail.com grapefruit 15-02-2019
cde@gmail.com blackberry 06-02-2019
cde@gmail.com lime 15-02-2019
cde@gmail.com lime 20-02-2019
zzz@gmail.com apple 02-02-2019
zzz@gmail.com apple 18-02-2019
zzz@gmail.com orange 19-02-2019
zzz@gmail.com apple 28-02-2019
Email ProductName DatePurchased SameProduct AnyProduct
abc@gmail.com cucumber 01-02-2019 0 0
abc@gmail.com orange 04-02-2019 0 1
abc@gmail.com grapefruit 15-02-2019 0 1
cde@gmail.com blackberry 06-02-2019 0 0
cde@gmail.com lime 15-02-2019 0 1
cde@gmail.com lime 20-02-2019 1 1
zzz@gmail.com apple 02-02-2019 0 0
zzz@gmail.com apple 18-02-2019 1 1
zzz@gmail.com orange 19-02-2019 0 1
zzz@gmail.com apple 28-02-2019 1 1
目标:
create table #table1 (email varchar(20), productname varchar(20), datepurchased date)
insert into #table1 values
('abc@gmail.com','cucumber','2019-02-01'),
('abc@gmail.com','orange','2019-02-04'),
('abc@gmail.com','grapefruit','2019-02-15'),
('cde@gmail.com','blackberry','2019-02-06'),
('cde@gmail.com','lime','2019-02-15'),
('cde@gmail.com','lime','2019-02-20'),
('zzz@gmail.com','apple','2019-02-02'),
('zzz@gmail.com','apple','2019-02-18'),
('zzz@gmail.com','orange','2019-02-19'),
('zzz@gmail.com','apple','2019-02-28')
我的输出将如下所示:
abc@gmail.com cucumber 01-02-2019
abc@gmail.com orange 04-02-2019
abc@gmail.com grapefruit 15-02-2019
cde@gmail.com blackberry 06-02-2019
cde@gmail.com lime 15-02-2019
cde@gmail.com lime 20-02-2019
zzz@gmail.com apple 02-02-2019
zzz@gmail.com apple 18-02-2019
zzz@gmail.com orange 19-02-2019
zzz@gmail.com apple 28-02-2019
Email ProductName DatePurchased SameProduct AnyProduct
abc@gmail.com cucumber 01-02-2019 0 0
abc@gmail.com orange 04-02-2019 0 1
abc@gmail.com grapefruit 15-02-2019 0 1
cde@gmail.com blackberry 06-02-2019 0 0
cde@gmail.com lime 15-02-2019 0 1
cde@gmail.com lime 20-02-2019 1 1
zzz@gmail.com apple 02-02-2019 0 0
zzz@gmail.com apple 18-02-2019 1 1
zzz@gmail.com orange 19-02-2019 0 1
zzz@gmail.com apple 28-02-2019 1 1
我尝试的是:
我曾两次尝试加入到自身和用例语句中,但我觉得这种方式效率极低
虚拟数据:
create table #table1 (email varchar(20), productname varchar(20), datepurchased date)
insert into #table1 values
('abc@gmail.com','cucumber','2019-02-01'),
('abc@gmail.com','orange','2019-02-04'),
('abc@gmail.com','grapefruit','2019-02-15'),
('cde@gmail.com','blackberry','2019-02-06'),
('cde@gmail.com','lime','2019-02-15'),
('cde@gmail.com','lime','2019-02-20'),
('zzz@gmail.com','apple','2019-02-02'),
('zzz@gmail.com','apple','2019-02-18'),
('zzz@gmail.com','orange','2019-02-19'),
('zzz@gmail.com','apple','2019-02-28')
注意:我的实际数据超过100万行。我不确定什么类型的查询可以使数据处理尽可能快。一种方法是使用
计数窗口功能或行数
--count
select t.*
,case when count(*) over(partition by email,productname order by datepurchased) > 1 then 1 else 0 end as same_prev
,case when count(*) over(partition by email order by datepurchased) > 1 then 1 else 0 end as any_prev
from tbl t
--row_number
select t.*
,case when row_number() over(partition by email,productname order by datepurchased) > 1 then 1 else 0 end as same_prev
,case when row_number() over(partition by email order by datepurchased) > 1 then 1 else 0 end as any_prev
from tbl t
我的解决方案是使用LAG()
和ROW\u NUMBER()
LAG()
ROW_NUMBER()
仅用于标记首次购买(ROW NUMBER=1)
当然,partitionby
和ORDER BY
子句对于以正确的顺序获取记录非常重要
我还检查了Vamsi Prabhalas的解决方案,但是IIF
的性能似乎比CASE-WHEN
快得多
SELECT email
,productname
,datepurchased
,IIF(LAG(productname) OVER (PARTITION BY email ORDER BY email, datepurchased) = productname, 1,0) AS SameProduct
,IIF(ROW_NUMBER() OVER (PARTITION BY email ORDER BY email, datepurchased) = 1, 0, 1) AS AnyProduct
FROM #table1
还有一个选项可以得到结果
我使用ROW_NUMBER()-1,这样我们就可以给第一次出现的值设置为零值。然后我使用SIGN()将任何正值转换为1
SELECT *,
SameProduct = SIGN(ROW_NUMBER() OVER(PARTITION BY email, productname ORDER BY datepurchased)-1),
AnyProduct = SIGN(ROW_NUMBER() OVER(PARTITION BY email ORDER BY datepurchased)-1)
FROM #table1
ORDER BY email, datepurchased;
如果需要,可以将其转换为位,以获得与使用SIGN()相同的结果,但仅在这种情况下,所有值都为正值
SELECT *,
SameProduct = CAST(ROW_NUMBER() OVER(PARTITION BY email, productname ORDER BY datepurchased)-1 AS bit),
AnyProduct = CAST(ROW_NUMBER() OVER(PARTITION BY email ORDER BY datepurchased)-1 AS bit)
FROM #table1
ORDER BY email, datepurchased;
我会使用行编号()
:
请注意,唯一的区别是行号()
您也可以在不进行比较的情况下执行此操作:
select t.*,
coalesce(max(1) over (partition by email, productname order by datepurchased rows between unbounded preceding and 1 preceding), 0) as same_product,
coalesce(max(1) over (partition by email order by datepurchased rows between unbounded preceding and 1 preceding), 0) as any_product
from table1 t
order by email, datepurchased;
是一个dbfiddle。对于数百万行来说,这个窗口功能不是非常慢吗?“电子邮件”列上只有一个索引,您可以尝试创建新的索引以包括datepurchased
列,并比较不同的执行计划。如果您通过电子邮件进行分区,则无需通过电子邮件订购。每个分区上的电子邮件将始终相同。@LuisCazares您是对的。只是我的习惯。关于执行计划,应该没有区别吗?你能说明你所说的强制转换是什么意思吗?我编辑了答案,给出了一个如何使用强制转换而不是符号的例子。但是它做什么呢?目的是什么?