Hive 在配置单元表(配置单元QL)的每一行中出现的循环值

Hive 在配置单元表(配置单元QL)的每一行中出现的循环值,hive,hiveql,Hive,Hiveql,假设我有一个名为table的配置单元表,如下所示: | lower | upper | |-------|-------| | 1 | 10 | | 2 | 3 | 假设下一列中的值严格小于上一列中相应的值。我想要第三列,它的值是从下一列到上一列的整数值的集合。为了具体起见,假设我希望第三列是上下之间所有整数的总和,即表 | lower | upper | sum | |-------|-------|-----| | 1 | 10 | 55 |

假设我有一个名为table的配置单元表,如下所示:

| lower | upper |
|-------|-------|
| 1     | 10    |
| 2     | 3     |
假设下一列中的值严格小于上一列中相应的值。我想要第三列,它的值是从下一列到上一列的整数值的集合。为了具体起见,假设我希望第三列是上下之间所有整数的总和,即表

| lower | upper | sum |
|-------|-------|-----|
| 1     | 10    | 55  |
| 2     | 3     | 5   |
配置单元中的查询如下所示

SELECT lower, upper, SUM(...) AS sum FROM table;
但我无法计算出总数。。。会的。我认为适当的修改

SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM T;
应该可以,但我不知道如何修改它。

我是这样做的

我已经用python创建了一个小代码sumhive.py

sumhive.py-->

import sys
try:
    for line in sys.stdin:
        line = line.strip()
        nums = line.split('\t')
        num1 = int(nums[0])
        num2 = int(nums[1])
        sum=0
        for i in range(num1,num2+1):
            sum=sum+i
        sys.stdout.write('\t'.join([str(num1),str(num2),str(sum)]) + '\n')

except:
    print(sys.exc_info())
更改pyfile的模式:

python]$ chmod +x sumhive.py
现在将此python udf添加到配置单元:

hive> add FILE /home/xxx/user/vikrant/python/sumhive.py;
Added resources: [/home/xxx/user/vikrant/python/sumhive.py]
下面是您在hive中的表格:

hive> select * from db.yourhivetable;
OK
1       10
2       5
运行下面的查询以使用python udf转换结果

select TRANSFORM (lower,upper) USING 'python sumhive.py' As (num1,num2,sum) FROM db.yourhivetable;
结果:

MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.56 sec   HDFS Read: 5136 HDFS Write: 15 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 560 msec
OK
1       10      55
2       5       14
Time taken: 14.635 seconds, Fetched: 2 row(s)
更新:在配置单元中使用sql查询

我使用posexplode来获得所需的结果

hive> select * from db.yourhivetable;
OK
1       10
2       5
以下是查询:

select t.start_range,t.end_range,sum(t.start_range+pe.i) as seq from
(select lower as start_range,upper as end_range from db.yourhivetable) t
lateral view posexplode(split(space(end_range - start_range),' ')) pe as i,s
group by t.start_range,t.end_range
;
给你


谢谢你的回答。上面的问题是我最初用Python编写的大型计算例程的原型,但为了提高速度,我希望使用mapreduce特性来实现。也许用Java编写一个UDF是一个不错的选择。@abhishekParab-刚才还用sql更新了答案。花了一些时间来完成它。:-它运行得非常完美,比使用python udf更快。感谢您的编辑,它可以工作!我是Hive的新手,不知道横向视图POSEXPLODE,但会尝试理解您的代码。再次感谢!
    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.04 s
--------------------------------------------------------------------------------
OK
1       10      55
2       5       14
Time taken: 5.148 seconds, Fetched: 2 row(s)
CREATE EXTERNAL TABLE sum_n(
l int , u int 
);


insert into sum_n values (1,10),(2,3),(2,5),(3,7);
select l,u,case
when u-l=1 then u+l
when u-l>1 and l=1 then (u*u+u)/2
when u-l > 1 and l <> 1 then (u * u + u)/2 - (l * l - l)/2
else null
end as summ
from sum_n;

l       u       summ
1       10      55.0
2       3       5.0
2       5       14.0
3       7       25.0