Hadoop查询以比较行值和组值，条件为_Hadoop_Hive_Impala

Hadoop查询以比较行值和组值，条件为

hadoop hive

Hadoop查询以比较行值和组值，条件为,hadoop,hive,impala,Hadoop,Hive,Impala,我希望将一些R代码移植到Hadoop，以便与Impala或Hive一起使用，并使用类似SQL的查询。我的代码基于以下问题：对于每一行，点si用于查找子组1中具有相同id且价格较低的行数假设我有以下数据： CREATE TABLE project ( id int, price int, subgroup int ); INSERT INTO project(id,price,subgroup) VALUES (1, 10, 1), (1,

我希望将一些R代码移植到Hadoop，以便与Impala或Hive一起使用，并使用类似SQL的查询。我的代码基于以下问题：

对于每一行，点si用于查找子组1中具有相同id且价格较低的行数

假设我有以下数据：

CREATE TABLE project
(
    id int,
    price int, 
    subgroup int
);

INSERT INTO project(id,price,subgroup) 
VALUES
    (1, 10, 1), 
    (1, 10, 1), 
    (1, 12, 1),
    (1, 15, 1),
    (1,  8, 2),
    (1, 11, 2),
    (2,  9, 1),
    (2, 12, 1),
    (2, 14, 2),
    (2, 18, 2);

现在，对于子组1中的行，以下查询在Impala中运行良好：

select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1

但我还需要处理子组2中的行

因此，我希望得到的结果是：

id  price   subgroup   cheaper
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   12      1          2 ( rows 1 and 2 are cheaper)
1   15      1          3
1    8      2          0 (nobody is cheaper in id 1 and subgroup 1)
1   11      2          2
2    9      1          0
2   12      1          1
2   14      2          2
2   18      2          2

我们可以尝试以下查询：-

select * from 
    (
    select *, rank() over (partition by id order by price asc) - 1 as cheaper
    from project
    where subgroup = 1 union
    select *, rank() over (partition by id order by price asc) - 1 as cheaper
    from project
    where subgroup = 2) result

不久前，我遇到了完全相同的问题。这就像你需要一个窗口函数，你可以在里面放一个

where

子句。为了解决这个问题，我将price收集到一个数组（其中subgroup=1）中，并自联接表。然后我编写了一个UDF来过滤给定谓词的数组

UDF：

package somepkg;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;

public class FilterArrayUDF extends UDF {
    public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
        ArrayList<Integer> newList = new ArrayList<Integer>();

        for (i = 0; i < arr.size(); i++) {
            int elem = Integer.parseInt((arr.get(i)).toString());
            if (elem < p)
                newList.add(elem);
        }
        return newList;
    }
}

add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';

select B.id, price, subgroup, price_arr
  , filter_arr(price_arr, price) cheaper_arr
  , size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
  select id, collect_list(price) price_arr
  from db.tbl
  where subgroup = 1
  group by id ) A
on B.id = A.id

1    10    1    [10,10,12,15]    []               0
1    10    1    [10,10,12,15]    []               0
1    12    1    [10,10,12,15]    [10,10]          2
1    15    1    [10,10,12,15]    [10,10,12]       3
1    8     2    [10,10,12,15]    []               0
1    11    2    [10,10,12,15]    [10,10]          2
2    9     1    [9,12]           []               0
2    12    1    [9,12]           [9]              1
2    14    2    [9,12]           [9,12]           2
2    18    2    [9,12]           [9,12]           2

输出：

package somepkg;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;

public class FilterArrayUDF extends UDF {
    public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
        ArrayList<Integer> newList = new ArrayList<Integer>();

        for (i = 0; i < arr.size(); i++) {
            int elem = Integer.parseInt((arr.get(i)).toString());
            if (elem < p)
                newList.add(elem);
        }
        return newList;
    }
}

add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';

select B.id, price, subgroup, price_arr
  , filter_arr(price_arr, price) cheaper_arr
  , size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
  select id, collect_list(price) price_arr
  from db.tbl
  where subgroup = 1
  group by id ) A
on B.id = A.id

1    10    1    [10,10,12,15]    []               0
1    10    1    [10,10,12,15]    []               0
1    12    1    [10,10,12,15]    [10,10]          2
1    15    1    [10,10,12,15]    [10,10,12]       3
1    8     2    [10,10,12,15]    []               0
1    11    2    [10,10,12,15]    [10,10]          2
2    9     1    [9,12]           []               0
2    12    1    [9,12]           [9]              1
2    14    2    [9,12]           [9,12]           2
2    18    2    [9,12]           [9,12]           2

谢谢，但这不是我想要的：我想将第2组中的价格与第1组中的价格进行比较。因此，每一行（无论在哪个子组中）都将与子组2中的所有价格进行比较。