SQL-使用条件和分区对列中的值进行计数
我有以下数据集,其中包含一只动物及其疫苗日期。我试图通过SQL在每个记录中计算宠物在过去90天、180天和365天内接种了多少疫苗。我能在Excel中计算出这一点。将下表粘贴到Excel单元格A1中,并将以下公式COUNTIFSC:C,=&C2-90,A:A,A2放置在单元格D2中。您可以分别将90调整为180和365SQL-使用条件和分区对列中的值进行计数,sql,datetime,count,teradata,window-functions,Sql,Datetime,Count,Teradata,Window Functions,我有以下数据集,其中包含一只动物及其疫苗日期。我试图通过SQL在每个记录中计算宠物在过去90天、180天和365天内接种了多少疫苗。我能在Excel中计算出这一点。将下表粘贴到Excel单元格A1中,并将以下公式COUNTIFSC:C,=&C2-90,A:A,A2放置在单元格D2中。您可以分别将90调整为180和365 Animal Visit_ID Vaccine_Date Count_90 Count_180 Count_365 Cat 1 7/22/2017
Animal Visit_ID Vaccine_Date Count_90 Count_180 Count_365
Cat 1 7/22/2017 0 0 0
Cat 2 8/1/2017 1 1 1
Cat 3 8/14/2017 2 2 2
Cat 4 8/23/2017 3 3 3
Cat 5 9/11/2017 4 4 4
Cat 6 9/30/2017 5 5 5
Cat 7 10/11/2017 6 6 6
Cat 8 10/23/2017 6 7 7
Cat 9 10/31/2017 6 8 8
Cat 10 11/6/2017 7 9 9
Cat 11 11/17/2017 7 10 10
Cat 12 11/29/2017 7 11 11
Cat 13 12/11/2017 7 12 12
Cat 14 12/25/2017 8 13 13
Cat 15 1/2/2018 8 14 14
Cat 16 1/29/2018 7 13 15
Cat 17 2/22/2018 5 12 16
Cat 18 3/9/2018 5 13 17
Cat 19 3/21/2018 5 13 18
Cat 20 4/13/2018 4 12 19
Cat 21 5/21/2018 4 9 20
Cat 22 8/27/2018 0 4 17
Cat 23 9/18/2018 1 3 17
Cat 24 10/3/2018 2 4 17
Cat 25 12/19/2018 1 3 11
Cat 26 12/22/2018 2 4 12
Cat 27 1/6/2019 2 5 11
Cat 28 1/30/2019 3 6 11
Cat 29 3/10/2019 4 6 10
Cat 30 3/26/2019 3 6 10
Cat 31 4/17/2019 3 6 10
Cat 32 5/13/2019 3 7 11
Cat 33 5/18/2019 4 8 12
Cat 34 5/25/2019 5 9 12
Cat 35 6/17/2019 5 10 13
Cat 36 7/2/2019 5 9 14
Cat 37 7/12/2019 6 9 15
Cat 38 8/2/2019 6 9 16
Cat 39 8/15/2019 6 10 17
Cat 40 8/27/2019 5 11 18
Cat 41 9/9/2019 6 11 18
Cat 42 9/17/2019 6 12 19
Cat 43 9/26/2019 7 12 19
Cat 44 10/9/2019 7 13 19
Cat 45 10/19/2019 7 13 20
Cat 46 11/12/2019 7 13 21
Cat 47 11/15/2019 7 13 22
Cat 48 11/26/2019 7 13 23
Cat 49 12/20/2019 6 13 23
Cat 50 12/31/2019 6 13 23
Cat 51 2/14/2020 3 11 22
Cat 52 3/8/2020 3 10 23
Cat 53 4/6/2020 2 9 22
Cat 54 5/5/2020 3 8 22
Cat 55 5/23/2020 3 7 21
Cat 56 6/18/2020 3 6 20
Cat 57 6/30/2020 4 6 21
Cat 58 7/16/2020 4 7 20
Cat 59 7/22/2020 5 8 21
Dog 1 3/8/2018 0:00 0 0 0
Dog 2 4/18/2019 0:00 0 0 0
Dog 3 7/1/2019 0:00 1 1 1
Dog 4 12/12/2019 0:00 0 1 2
Dog 5 12/23/2019 0:00 1 2 3
然而,当我试图用下面的代码通过SQL实现这一点时,它只是查看前一行并添加。它似乎不像上面的Excel公式那样在每个疫苗日期向后计数,我不知道如何集成一个窗口函数来计算过去发生的疫苗日期,该窗口函数以90180365间隔为基础,同时被动物分割
select
qry3.Animal
,qry3.Vaccine_Date
,(case when qry3.Count_90 = 0 then 0 else row_number() over (partition by qry3.Animal, qry3.Count_90_2 order by qry3.animal_rank) - 1 end) as Admit_90
,(case when qry3.Count_180 = 0 then 0 else row_number() over (partition by qry3.Animal, qry3.Count_180_2 order by qry3.animal_rank) - 1 end) as Admit_180
,(case when qry3.Count_365 = 0 then 0 else row_number() over (partition by qry3.Animal, qry3.Count_365_2 order by qry3.animal_rank) - 1 end) as Admit_365
from
(
select
qry2.Animal
,qry2.Vaccine_Date
,qry2.animal_rank
,qry2.Count_90
,qry2.Count_180
,qry2.Count_365
,sum(case when qry2.Count_90 = 0 then 1 else 0 end) over(partition by qry2.Animal order by qry2.animal_rank rows between unbounded preceding and current row) as Count_90_2
,sum(case when qry2.Count_180 = 0 then 1 else 0 end) over(partition by qry2.Animal order by qry2.animal_rank rows between unbounded preceding and current row) as Count_180_2
,sum(case when qry2.Count_365 = 0 then 1 else 0 end) over(partition by qry2.Animal order by qry2.animal_rank rows between unbounded preceding and current row) as Count_365_2
from
(
select
qry1.Animal
,qry1.Vaccine_Date
,qry1.animal_Rank
,case when qry1.Vaccine_Date-qry1.Previous_Vaccine_Date < 90 then 1 else 0 end as Count_90
,case when qry1.Vaccine_Date-qry1.Previous_Vaccine_Date < 180 then 1 else 0 end as Count_180
,case when qry1.Vaccine_Date-qry1.Previous_Vaccine_Date < 365 then 1 else 0 end as Count_365
from
(
select
a.Animal
,a.Vaccine_Date
,b.Vaccine_Date as Previous_Vaccine_Date
,row_number() over (partition by null order by A.Animal,a.Vaccine_Date) as animal_Rank
from Animal_Vaccine a
left join Animal_Vaccine b on a.Visit_ID = b.Visit_ID - 1
) as qry1
) as qry2
) as qry3
我能够解决这个问题,方法是在动物身上和疫苗接种日期之间回到桌子上,然后计算不同的就诊id。我完全把它复杂化了,但可以打开一个窗口功能,因为我喜欢它们
select
a.Animal
,a.Vaccine_Date
,a.visit_id
,count(distinct b.visit_id) - 1 as Count_90 --minus 1 so it as to not count itself
,count(distinct c.visit_id) - 1 as Count_180 --minus 1 so it as to not count itself
,count(distinct d.visit_id) - 1 as Count_365 --minus 1 so it as to not count itself
from Animal_Vaccine a
left join Animal_Vaccine b on a.Animal = b.Animal and b.Vaccine_date between a.Vaccine_Date - 90 and a.Vaccine_Date
left join Animal_Vaccine c on a.Animal = c.Animal and c.Vaccine_date between a.Vaccine_Date - 180 and a.Vaccine_Date
left join Animal_Vaccine d on a.Animal = d.Animal and d.Vaccine_date between a.Vaccine_Date - 365 and a.Vaccine_Date
group by 1,2,3
;
我能够解决这个问题,方法是在动物身上和疫苗接种日期之间回到桌子上,然后计算不同的就诊id。我完全把它复杂化了,但可以打开一个窗口功能,因为我喜欢它们
select
a.Animal
,a.Vaccine_Date
,a.visit_id
,count(distinct b.visit_id) - 1 as Count_90 --minus 1 so it as to not count itself
,count(distinct c.visit_id) - 1 as Count_180 --minus 1 so it as to not count itself
,count(distinct d.visit_id) - 1 as Count_365 --minus 1 so it as to not count itself
from Animal_Vaccine a
left join Animal_Vaccine b on a.Animal = b.Animal and b.Vaccine_date between a.Vaccine_Date - 90 and a.Vaccine_Date
left join Animal_Vaccine c on a.Animal = c.Animal and c.Vaccine_date between a.Vaccine_Date - 180 and a.Vaccine_Date
left join Animal_Vaccine d on a.Animal = d.Animal and d.Vaccine_date between a.Vaccine_Date - 365 and a.Vaccine_Date
group by 1,2,3
;
只要每个值的行数很低,您的方法就可以了。但如果每个值的行数增加,CPU将爆炸,您的DBA将给您打电话:- 但您的查询可以简化为单个产品联接,而不是三个,前提是动物/访问id是唯一的:
select
a.Animal
,a.Vaccine_Date
,a.visit_id
,count(case when b.Vaccine_date between a.Vaccine_Date - 90 and a.Vaccine_Date then b.visit_id end) as Count_90
,count(case when b.Vaccine_date between a.Vaccine_Date - 180 and a.Vaccine_Date then b.visit_id end) as Count_180
,count(b.visit_id) as Count_365
from vt a
left join vt b
on a.Animal = b.Animal
and b.Vaccine_date between a.Vaccine_Date - 365 and a.Vaccine_Date - 1 -- don't include current row
group by 1,2,3
如果Animal是主要索引,则性能可以得到改进,否则也可以使用该索引创建一个易失性表
根据每只动物的行数和日期范围,最好使用EXPAND创建缺少的日期,然后使用行代替范围:
select animal, Visit_ID, Vaccine_date
-- EXPAND ON returned one row per day with repeated data, the CASE effectily NULLs the added rows
,count(case when valid_from = Vaccine_date then 1 end) over (partition by animal order by Vaccine_date rows between 90 preceding and 1 preceding)
,count(case when valid_from = Vaccine_date then 1 end) over (partition by animal order by Vaccine_date rows between 180 preceding and 1 preceding)
,count(case when valid_from = Vaccine_date then 1 end) over (partition by animal order by Vaccine_date rows between 365 preceding and 1 preceding)
from
( -- create the missing dates to get 1 row per animal/day
select animal, Visit_ID, begin(pd) as Vaccine_date, valid_from
from
(
select animal, Visit_ID
,cast(Vaccine_date as date) as valid_from -- seems to be a Timestamp
-- get the next row's Vaccine_date for EXPAND ON in following step
/* -- LAG/LEAD not supported in TD 15.10
,lead(Vaccine_date,1,Vaccine_date+1)
over (partition by animal
order by Vaccine_date) as valid_to
*/ -- workaround for LEAD
,coalesce(min(Vaccine_date)
over (partition by animal
order by Vaccine_date
rows between 1 following and 1 following)
,Vaccine_date +1) as valid_to -- replace the last row's NULL with a valid end date
from vt
) as prepare_data
expand on period(valid_from, valid_to) as pd
) as expand_data
-- now remove the added dates again
qualify valid_from = Vaccine_date
这假设每只动物每天只有一行,否则在展开过程中会收到错误消息。然后必须在准备步骤中聚合数据。只要每个值的行数很低,您的方法就可以了。但如果每个值的行数增加,CPU将爆炸,您的DBA将给您打电话:- 但您的查询可以简化为单个产品联接,而不是三个,前提是动物/访问id是唯一的:
select
a.Animal
,a.Vaccine_Date
,a.visit_id
,count(case when b.Vaccine_date between a.Vaccine_Date - 90 and a.Vaccine_Date then b.visit_id end) as Count_90
,count(case when b.Vaccine_date between a.Vaccine_Date - 180 and a.Vaccine_Date then b.visit_id end) as Count_180
,count(b.visit_id) as Count_365
from vt a
left join vt b
on a.Animal = b.Animal
and b.Vaccine_date between a.Vaccine_Date - 365 and a.Vaccine_Date - 1 -- don't include current row
group by 1,2,3
如果Animal是主要索引,则性能可以得到改进,否则也可以使用该索引创建一个易失性表
根据每只动物的行数和日期范围,最好使用EXPAND创建缺少的日期,然后使用行代替范围:
select animal, Visit_ID, Vaccine_date
-- EXPAND ON returned one row per day with repeated data, the CASE effectily NULLs the added rows
,count(case when valid_from = Vaccine_date then 1 end) over (partition by animal order by Vaccine_date rows between 90 preceding and 1 preceding)
,count(case when valid_from = Vaccine_date then 1 end) over (partition by animal order by Vaccine_date rows between 180 preceding and 1 preceding)
,count(case when valid_from = Vaccine_date then 1 end) over (partition by animal order by Vaccine_date rows between 365 preceding and 1 preceding)
from
( -- create the missing dates to get 1 row per animal/day
select animal, Visit_ID, begin(pd) as Vaccine_date, valid_from
from
(
select animal, Visit_ID
,cast(Vaccine_date as date) as valid_from -- seems to be a Timestamp
-- get the next row's Vaccine_date for EXPAND ON in following step
/* -- LAG/LEAD not supported in TD 15.10
,lead(Vaccine_date,1,Vaccine_date+1)
over (partition by animal
order by Vaccine_date) as valid_to
*/ -- workaround for LEAD
,coalesce(min(Vaccine_date)
over (partition by animal
order by Vaccine_date
rows between 1 following and 1 following)
,Vaccine_date +1) as valid_to -- replace the last row's NULL with a valid end date
from vt
) as prepare_data
expand on period(valid_from, valid_to) as pd
) as expand_data
-- now remove the added dates again
qualify valid_from = Vaccine_date
这假设每只动物每天只有一行,否则在展开过程中会收到错误消息。然后,必须在准备步骤中聚合数据。不幸的是,Teradata不支持范围窗口帧,因此无法使用窗口函数执行所需操作。dnoeth建议的使用一个左连接和条件聚合的解决方案提供了一个有效的解决方案 但是,我想知道两个相关子查询是否会执行得更好,因为它根本不需要外部聚合:
select
v.*,
(select count(*) from vt v1 where v1.animal = v.animal and v1.vaccine_date >= v.vaccine_date - 90 and v1.vaccine_date < v.vaccine_date) count_90,
(select count(*) from vt v1 where v1.animal = v.animal and v1.vaccine_date >= v.vaccine_date - 180 and v1.vaccine_date < v.vaccine_date) count_180,
(select count(*) from vt v1 where v1.animal = v.animal and v1.vaccine_date >= v.vaccine_date - 365 and v1.vaccine_date < v.vaccine_date) count_365
from vt v
对于此查询,请确保您有一个关于动物、疫苗和日期的索引。不幸的是,Teradata不支持范围窗口框架,因此无法使用窗口函数执行您想要的操作。dnoeth建议的使用一个左连接和条件聚合的解决方案提供了一个有效的解决方案 但是,我想知道两个相关子查询是否会执行得更好,因为它根本不需要外部聚合:
select
v.*,
(select count(*) from vt v1 where v1.animal = v.animal and v1.vaccine_date >= v.vaccine_date - 90 and v1.vaccine_date < v.vaccine_date) count_90,
(select count(*) from vt v1 where v1.animal = v.animal and v1.vaccine_date >= v.vaccine_date - 180 and v1.vaccine_date < v.vaccine_date) count_180,
(select count(*) from vt v1 where v1.animal = v.animal and v1.vaccine_date >= v.vaccine_date - 365 and v1.vaccine_date < v.vaccine_date) count_365
from vt v
对于此查询,请确保您有关于动物、疫苗和日期的索引。您的Teradata版本是什么?看起来我在15.10版上您的Teradata版本是什么?看起来我在15.10版上每个动物的行数似乎很低,否则,此三重乘积连接将消耗大量CPU:-每只动物的行数似乎较低,否则此三重乘积连接将消耗大量CPU:-应优于这三个连接加上distinct,但由于Teradata是一个庞大的并行系统,这些非相等的标量相关子查询也很难优化。该索引不会有帮助,除非它是一个物化视图,而且他可能不会被允许在该表上创建索引。应该比这三个连接加上distinct更好,但由于Teradata是一个大规模并行系统,这些标量相关的不相等子查询也很难优化。除非它是一个物化视图,否则该索引不会有帮助,而且他可能不被允许在该表上创建索引。