Hadoop 如何在蜂巢中找到最近的邻居?有窗口功能吗?
给一张桌子 预期结果:Hadoop 如何在蜂巢中找到最近的邻居?有窗口功能吗?,hadoop,mapreduce,hive,hiveql,Hadoop,Mapreduce,Hive,Hiveql,给一张桌子 预期结果: ID0, ID1 1,2 4,5 6,7 8,7 对于上面带有Flag=0的每个ID,我们希望从Flag=1中找到另一个ID,具有相同的“State”和“City”,以及最接近的价格 我有两个愚蠢的想法: 方法1 Use a left outer join with the table itself on (a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1), where a.Fl
ID0, ID1
1,2
4,5
6,7
8,7
对于上面带有Flag=0的每个ID,我们希望从Flag=1中找到另一个ID,具有相同的“State”和“City”,以及最接近的价格
我有两个愚蠢的想法:
方法1
Use a left outer join with the table itself on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then use RANK() over (partitioned by a.State,a.City order by a.Price - b.Price) as rank
where rank=1
方法2
Use a left outer join with the table itself,
on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then Use Distribute by a.State,a.City Sort by Price_Diff ASC limit 1
在蜂巢中找到最近邻居的最好方法是什么?
任何有价值的提示将不胜感激
select a.id, b.id , min(abs(b.price-a.price)) as delta
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city
group by a.id, b.id
order by delta asc;
这是回报
1 2 1 <---
8 7 2 <---
6 7 3 <---
4 5 4 <---
8 9 10
6 9 15
1 3 100
这会回来的
id0 id1 prc rank
1 2 1 1 <---
1 3 100 2
4 5 4 1 <---
8 7 2 1 <---
6 7 3 2
8 9 10 3
6 9 15 4
(6,7)、(6,9)、(8,7)、(8,9)的最低价差在(8,7)中。(不明确连接)
我想你会喜欢这个关于这个主题的视频:
select a.id as id0, b.id as id1, abs(b.price-a.price) as delta,
rank() over ( partition by a.country, a.city order by abs(b.price-a.price) )
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city;
id0 id1 prc rank
1 2 1 1 <---
1 3 100 2
4 5 4 1 <---
8 7 2 1 <---
6 7 3 2
8 9 10 3
6 9 15 4
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1