Hadoop查询以比较行值和组值,条件为
我希望将一些R代码移植到Hadoop,以便与Impala或Hive一起使用,并使用类似SQL的查询。 我的代码基于以下问题: 对于每一行,点si用于查找子组1中具有相同id且价格较低的行数 假设我有以下数据:Hadoop查询以比较行值和组值,条件为,hadoop,hive,impala,Hadoop,Hive,Impala,我希望将一些R代码移植到Hadoop,以便与Impala或Hive一起使用,并使用类似SQL的查询。 我的代码基于以下问题: 对于每一行,点si用于查找子组1中具有相同id且价格较低的行数 假设我有以下数据: CREATE TABLE project ( id int, price int, subgroup int ); INSERT INTO project(id,price,subgroup) VALUES (1, 10, 1), (1,
CREATE TABLE project
(
id int,
price int,
subgroup int
);
INSERT INTO project(id,price,subgroup)
VALUES
(1, 10, 1),
(1, 10, 1),
(1, 12, 1),
(1, 15, 1),
(1, 8, 2),
(1, 11, 2),
(2, 9, 1),
(2, 12, 1),
(2, 14, 2),
(2, 18, 2);
现在,对于子组1中的行,以下查询在Impala中运行良好:
select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1
但我还需要处理子组2中的行
因此,我希望得到的结果是:
id price subgroup cheaper
1 10 1 0 ( because no row is cheaper in id 1 subgroup 1)
1 10 1 0 ( because no row is cheaper in id 1 subgroup 1)
1 12 1 2 ( rows 1 and 2 are cheaper)
1 15 1 3
1 8 2 0 (nobody is cheaper in id 1 and subgroup 1)
1 11 2 2
2 9 1 0
2 12 1 1
2 14 2 2
2 18 2 2
我们可以尝试以下查询:-
select * from
(
select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1 union
select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 2) result
不久前,我遇到了完全相同的问题。这就像你需要一个窗口函数,你可以在里面放一个
where
子句。为了解决这个问题,我将price收集到一个数组(其中subgroup=1)中,并自联接表。然后我编写了一个UDF来过滤给定谓词的数组
UDF:
package somepkg;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
public class FilterArrayUDF extends UDF {
public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
ArrayList<Integer> newList = new ArrayList<Integer>();
for (i = 0; i < arr.size(); i++) {
int elem = Integer.parseInt((arr.get(i)).toString());
if (elem < p)
newList.add(elem);
}
return newList;
}
}
add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';
select B.id, price, subgroup, price_arr
, filter_arr(price_arr, price) cheaper_arr
, size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
select id, collect_list(price) price_arr
from db.tbl
where subgroup = 1
group by id ) A
on B.id = A.id
1 10 1 [10,10,12,15] [] 0
1 10 1 [10,10,12,15] [] 0
1 12 1 [10,10,12,15] [10,10] 2
1 15 1 [10,10,12,15] [10,10,12] 3
1 8 2 [10,10,12,15] [] 0
1 11 2 [10,10,12,15] [10,10] 2
2 9 1 [9,12] [] 0
2 12 1 [9,12] [9] 1
2 14 2 [9,12] [9,12] 2
2 18 2 [9,12] [9,12] 2
输出:
package somepkg;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
public class FilterArrayUDF extends UDF {
public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
ArrayList<Integer> newList = new ArrayList<Integer>();
for (i = 0; i < arr.size(); i++) {
int elem = Integer.parseInt((arr.get(i)).toString());
if (elem < p)
newList.add(elem);
}
return newList;
}
}
add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';
select B.id, price, subgroup, price_arr
, filter_arr(price_arr, price) cheaper_arr
, size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
select id, collect_list(price) price_arr
from db.tbl
where subgroup = 1
group by id ) A
on B.id = A.id
1 10 1 [10,10,12,15] [] 0
1 10 1 [10,10,12,15] [] 0
1 12 1 [10,10,12,15] [10,10] 2
1 15 1 [10,10,12,15] [10,10,12] 3
1 8 2 [10,10,12,15] [] 0
1 11 2 [10,10,12,15] [10,10] 2
2 9 1 [9,12] [] 0
2 12 1 [9,12] [9] 1
2 14 2 [9,12] [9,12] 2
2 18 2 [9,12] [9,12] 2
谢谢,但这不是我想要的:我想将第2组中的价格与第1组中的价格进行比较。因此,每一行(无论在哪个子组中)都将与子组2中的所有价格进行比较。