Hadoop 按原因分组的配置单元substr性能低下_Hadoop_Mapreduce_Hive

Hadoop 按原因分组的配置单元substr性能低下

hadoop mapreduce hive

Hadoop 按原因分组的配置单元substr性能低下,hadoop,mapreduce,hive,Hadoop,Mapreduce,Hive,我有一个表，每个分区中有大约300000条记录在配置单元select和substr中运行以下查询时，它将挂起在步骤上：map=0% select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info from ( select substr(stat_time,1,8) as stat_date,user_id,city_id,soh,dept_

我有一个表，每个分区中有大约300000条记录

在配置单元select和substr中运行以下查询时，它将挂起在步骤上：map=0%

select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
    select substr(stat_time,1,8) as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
    from db.table
    where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id

但是，如果我将select substrstat_time，1,8 as stat_date中的内部查询替换为select stat_time as stat_date，它将正常执行

stat_time是YYYYMMDDHHmm=201702210900格式的东西

select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
    select stat_time as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
    from db.table
    where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id

那么为什么substr会导致性能降低呢

编辑：

我将mapred-site.xml中的mapred.child.java.opts和HADOOP-env.sh中的HADOOP_HEAPSIZE更改为4G。它成功了

我猜保存或计算所有substr都会导致大量堆

如果有人知道为什么使用普通字段值比substr占用内存少，请留下评论/回答。

性能慢还是没有性能？查询似乎还没有开始运行，我怀疑它是否与使用substr的SUBSTRINSTAD有任何关系。为什么不使用date函数？@mbaxi Stat_time没有秒数，这意味着如果使用date函数，将需要预处理数据。所有的预处理步骤加上一个内置的日期函数都比substr快，这对我来说毫无意义。@DuduMarkovitz我想你是对的，这项工作一开始就卡住了。实际上，我将mapred-site.xml中的mapred.child.java.opts和HADOOP-env.sh中的HADOOP_HEAPSIZE更改为4G。它终于成功了。太棒了：-。。。。。。。。。。。。