Hive 在配置单元表(配置单元QL)的每一行中出现的循环值
假设我有一个名为table的配置单元表,如下所示:Hive 在配置单元表(配置单元QL)的每一行中出现的循环值,hive,hiveql,Hive,Hiveql,假设我有一个名为table的配置单元表,如下所示: | lower | upper | |-------|-------| | 1 | 10 | | 2 | 3 | 假设下一列中的值严格小于上一列中相应的值。我想要第三列,它的值是从下一列到上一列的整数值的集合。为了具体起见,假设我希望第三列是上下之间所有整数的总和,即表 | lower | upper | sum | |-------|-------|-----| | 1 | 10 | 55 |
| lower | upper |
|-------|-------|
| 1 | 10 |
| 2 | 3 |
假设下一列中的值严格小于上一列中相应的值。我想要第三列,它的值是从下一列到上一列的整数值的集合。为了具体起见,假设我希望第三列是上下之间所有整数的总和,即表
| lower | upper | sum |
|-------|-------|-----|
| 1 | 10 | 55 |
| 2 | 3 | 5 |
配置单元中的查询如下所示
SELECT lower, upper, SUM(...) AS sum FROM table;
但我无法计算出总数。。。会的。我认为适当的修改
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM T;
应该可以,但我不知道如何修改它。我是这样做的
我已经用python创建了一个小代码sumhive.py
sumhive.py-->
import sys
try:
for line in sys.stdin:
line = line.strip()
nums = line.split('\t')
num1 = int(nums[0])
num2 = int(nums[1])
sum=0
for i in range(num1,num2+1):
sum=sum+i
sys.stdout.write('\t'.join([str(num1),str(num2),str(sum)]) + '\n')
except:
print(sys.exc_info())
更改pyfile的模式:
python]$ chmod +x sumhive.py
现在将此python udf添加到配置单元:
hive> add FILE /home/xxx/user/vikrant/python/sumhive.py;
Added resources: [/home/xxx/user/vikrant/python/sumhive.py]
下面是您在hive中的表格:
hive> select * from db.yourhivetable;
OK
1 10
2 5
运行下面的查询以使用python udf转换结果
select TRANSFORM (lower,upper) USING 'python sumhive.py' As (num1,num2,sum) FROM db.yourhivetable;
结果:
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.56 sec HDFS Read: 5136 HDFS Write: 15 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 560 msec
OK
1 10 55
2 5 14
Time taken: 14.635 seconds, Fetched: 2 row(s)
更新:在配置单元中使用sql查询
我使用posexplode来获得所需的结果
hive> select * from db.yourhivetable;
OK
1 10
2 5
以下是查询:
select t.start_range,t.end_range,sum(t.start_range+pe.i) as seq from
(select lower as start_range,upper as end_range from db.yourhivetable) t
lateral view posexplode(split(space(end_range - start_range),' ')) pe as i,s
group by t.start_range,t.end_range
;
给你
谢谢你的回答。上面的问题是我最初用Python编写的大型计算例程的原型,但为了提高速度,我希望使用mapreduce特性来实现。也许用Java编写一个UDF是一个不错的选择。@abhishekParab-刚才还用sql更新了答案。花了一些时间来完成它。:-它运行得非常完美,比使用python udf更快。感谢您的编辑,它可以工作!我是Hive的新手,不知道横向视图POSEXPLODE,但会尝试理解您的代码。再次感谢!
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 4.04 s
--------------------------------------------------------------------------------
OK
1 10 55
2 5 14
Time taken: 5.148 seconds, Fetched: 2 row(s)
CREATE EXTERNAL TABLE sum_n(
l int , u int
);
insert into sum_n values (1,10),(2,3),(2,5),(3,7);
select l,u,case
when u-l=1 then u+l
when u-l>1 and l=1 then (u*u+u)/2
when u-l > 1 and l <> 1 then (u * u + u)/2 - (l * l - l)/2
else null
end as summ
from sum_n;
l u summ
1 10 55.0
2 3 5.0
2 5 14.0
3 7 25.0