Sql 折叠配置单元中的行并保持非空值
我在配置单元中有一个表,其中athr_名称和post_日期字段为90%null(在配置单元中用“?”表示)。我想通过athr_名称、发布日期、页面和访问日期查询表和组,以获得访问次数和访客数。但是,我还希望将空值与athr_名称和post_日期不为空的值合并并替换为空值(page_nm包含唯一值,因此只能有正确的athr_名称或空值) 换句话说,我有:Sql 折叠配置单元中的行并保持非空值,sql,hive,hiveql,Sql,Hive,Hiveql,我在配置单元中有一个表,其中athr_名称和post_日期字段为90%null(在配置单元中用“?”表示)。我想通过athr_名称、发布日期、页面和访问日期查询表和组,以获得访问次数和访客数。但是,我还希望将空值与athr_名称和post_日期不为空的值合并并替换为空值(page_nm包含唯一值,因此只能有正确的athr_名称或空值) 换句话说,我有: athr_name post_date page_nm visit_date visit visitors 1
athr_name post_date page_nm visit_date visit visitors
1 Steve 9/1/2019 /page1/content/ 20191014 45 11
2 Steve 9/1/2019 /page1/content/ 20191015 62 38
3 Steve 9/1/2019 /page1/content/ 20191016 28 49
4 Steve 9/1/2019 /page1/content/ 20191207 54 70
5 Steve 9/1/2019 /page1/content/ 20191208 39 26
6 ? ? /page1/content/ 20191014 28 24
7 ? ? /page1/content/ 20191015 17 63
8 ? ? /page1/content/ 20191016 48 40
9 ? ? /page1/content/ 20191017 47 14
10 ? ? /page1/content/ 20191018 33 1
我想把这些数据压缩成这样的结果:
athr_name post_date page_nm visit_date visit visitors
1 Steve 9/1/2019 /page1/content/ 20191014 73 35
2 Steve 9/1/2019 /page1/content/ 20191015 79 101
3 Steve 9/1/2019 /page1/content/ 20191016 76 89
4 Steve 9/1/2019 /page1/content/ 20191017 47 14
5 Steve 9/1/2019 /page1/content/ 20191018 33 1
6 Steve 9/1/2019 /page1/content/ 20191207 54 70
7 Steve 9/1/2019 /page1/content/ 20191208 39 26
如果它是列而不是行,则可以通过合并函数对其进行寻址。非常感谢您的帮助 这是你想要的吗
select max(athr_name), max(post_date), page_nm,
visit_date, sum(visit), sum(visitors)
from t
group by page_nm, visit_date;
首先,您需要使用填充空值。您的查询可能如下所示:
SELECT athr_name,
post_date,
page_nm, visit_date,
sum(visit),
sum(visitors)
from (
select nvl(athr_name, LAST_VALUE(athr_name, TRUE)
OVER (ORDER BY page_nm, athr_name NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as athr_name,
nvl(post_date, LAST_VALUE(post_date, TRUE)
OVER (ORDER BY page_nm, post_date NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as post_date,
page_nm,
visit_date,
visit,
visitors
from your_table) as tmp_view
GROUP BY athr_name, post_date, page_nm, visit_date;
更新:
如果某些页面可能没有相应的用户名或发布日期,最好使用此查询来保留此信息:
SELECT athr_name, post_date, page_nm, visit_date, sum(visit), sum(visitors)
from (
select name_view.athr_name as athr_name,
date_view.post_date as post_date,
main.page_nm,
main.visit_date,
main.visit,
main.visitors
from your_table main
LEFT JOIN (select athr_name, page_nm, row_number() over (PARTITION BY page_nm) as rn
from your_table
where athr_name is not null) name_view
ON main.page_nm = name_view.page_nm AND name_view.rn = 1
LEFT JOIN (select post_date, page_nm, row_number() over (PARTITION BY page_nm) as rn
from your_table
where post_date is not null) date_view
ON main.page_nm = date_view.page_nm AND date_view.rn = 1) as tmp_view
GROUP BY athr_name, post_date, page_nm, visit_date;
不,我相信这在SQL中会起作用,但在Hive中似乎不起作用。使用和不使用max()语句时,我得到的结果是相同的。@P5C768。如果您可以删除
max()
s,那么您的分组依据
与此答案中的不一样。谢谢@Lyashko,我发现我有一些潜在的数据问题,但我相信此解决方案会起作用。