Hive 配置单元多次计数(带和不带DISTINCT)生成错误输出
我尝试了这个蜂巢查询Hive 配置单元多次计数(带和不带DISTINCT)生成错误输出,hive,hiveql,Hive,Hiveql,我尝试了这个蜂巢查询 Select id,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) ,count(CASE WHEN unix_
Select id,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END)
,count(CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END)
From DB.TABLE2 GROUP BY id limit 10;
它给了我这样的信息:
111007001007633 1 1
111007001029793 1 1
111007001000521 1 11
111007001000794 1 1
111007001000273 3 13
111007001001032 1 1
111007001025874 1 4
111007001001792 1 7
111007001029181 1 1
111007001000141 16 96
但当我加上其他计数时:
Select id,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END)
,count(CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END)
,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),15) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END)
,count(CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),15) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END)
From DB.TABLE2 GROUP BY id limit 10;
它返回的是这样糟糕的结果:
111007001010439 0 0 1 0
111007001026963 0 0 1 0
111007001028001 0 0 1 0
111007001032987 0 0 1 0
111007001048710 0 0 1 0
111007001052415 0 0 1 0
111007002008374 0 0 1 0
111007003000644 0 0 1 0
111007003002210 0 0 1 0
我在hadoop集群上工作,我不知道这是否是由MapReduce引起的
谢谢
[编辑]
正如我对@pashaz comment的回答,第一个问题是来自两个相同查询的结果(带和不带distinct),其中1表示distinct,0表示非distinct
第二个问题是两个不同查询和两个非不同查询之间的结果。如果您检查时间戳,您将看到第一次查询包含秒数,因为两次第一次统计“2017-02-01”和60天之前的事件,第二次统计“2017-02-01”和15天之前的事件
[更新]
如果我把WHERE子句放进去,它就行了
WHERE id="111007001007633" OR id="271011604404359" OR id="122213250512607" OR id="111007001033217"
111007001033217 0 0 0 0 0 0
122213250512607 1 3 8 14 0 0
271011604404359 12 21 26 42 5 9
111007001007633 14 19 24 34 5 5
限制条款似乎是问题所在。在提供的结果中没有什么不好的。在这两个查询中都显示“限制10”。没有人保证会返回相同的ID
在第一个查询中,结果显示“111007001007633”,而在第二个查询中没有显示该结果。第二个查询是否为每一行返回这些结果(0,0,1,0)?如果您对第一个查询返回“有效”结果的行之一运行第二个查询,例如111007001000141,会发生什么情况?@Andrew我不知道,我会检查并尽快给您结果检查查询!每次第一个是不同的,第二个是相同的,没有不同的,所以有1然后有0是不正确的。。。我不确定有什么不好的