Hadoop 由于映射端聚合中使用了哈希映射,内存不足

Hadoop 由于映射端聚合中使用了哈希映射,内存不足,hadoop,hive,amazon-emr,hiveql,Hadoop,Hive,Amazon Emr,Hiveql,我的配置单元查询正在引发此异常 Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1 2013-05-22 12:08:32,634 Stage-1 map = 0%, reduce = 0% 2013-05-22 12:09:19,984 Stage-1 map = 100%, reduce = 100% Ended Job = job_201305221200_0001 with e

我的配置单元查询正在引发此异常

Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
2013-05-22 12:08:32,634 Stage-1 map = 0%,  reduce = 0%
2013-05-22 12:09:19,984 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201305221200_0001 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201305221200_0001_m_000007 (and more) from job job_201305221200_0001
Examining task ID: task_201305221200_0001_m_000003 (and more) from job job_201305221200_0001
Examining task ID: task_201305221200_0001_m_000001 (and more) from job job_201305221200_0001

Task with the most failures(4): 
-----
Task ID:
  task_201305221200_0001_m_000001

URL:
  http://ip-10-134-7-119.ap-southeast-1.compute.internal:9100/taskdetails.jsp?jobid=job_201305221200_0001&tipid=task_201305221200_0001_m_000001

Possible error:
  Out of memory due to hash maps used in map-side aggregation.

Solution:
  Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'
-----

Counters:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask


    select 
        uri, 
        count(*) as hits 
    from
        iislog
    where 
        substr(cs_cookie,instr(cs_Cookie,'cwc'),30) like '%CWC%'
    and uri like '%.aspx%' 
    and logdate = '2013-02-07' 
    group by uri 
    order by hits Desc;
我在8个EMR核心实例和1个8Gb数据上的大型主实例上尝试了这一点。首先,我尝试使用外部表(数据的位置是s3的路径)。之后,我将数据从S3下载到EMR,并使用本机配置单元表。但在这两个方面我都犯了同样的错误

FYI, i am using regex serde to parse the iislogs.

'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
               WITH SERDEPROPERTIES (
               "input.regex" ="([0-9-]+) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9-]+ [0-9:.]+) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([0-9-]+ [0-9:.]+)",
               "output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s %19$s %20$s %21$s %22$s %23$s %24$s %25$s %26$s %27$s %28$s %29$s %30$s %31$s %32$s")
location 's3://*******'; 
  • 表的位置对蜂巢来说并不重要
  • 如果您可以粘贴查询,那就更好了,这样您就可以知道映射器是否也在排序

    在任何情况下,我们都需要增加内存量。检查映射任务配置为使用(mapred.child…)运行的内存量。至少应该是1G左右。如果足够大,您可以:

    • 如果映射器没有排序:考虑将哈希聚合内存%撞到日志中指示的更高数量
    • 如果映射器正在排序,只需将任务内存增加到更大的数值

您是否尝试设置
设置hive.map.aggr.hash.percentmemory=0.25如消息中所述?

您可以阅读更多内容

我只看到一个查询,但您是说有一个例外?请包含有关异常本身的任何详细信息。@LukasVermeer错误是:由于在映射端聚合中使用了哈希映射,内存不足……尽管现在,我已经包含了详细信息Hanks joydeep……是的,我的映射程序正在排序……我试着调试这个问题,结果发现我的映射程序使用了5.5 gb的RAM。所以,我增加了RAM,它起作用了。在增加RAM之前,我还尝试设置hive.map.aggr.hash.percentmemory=0.25。我还有一个问题……假设为了减少内存的使用,我配置了两个参数来生成压缩输出。它是否有助于减少RAM的使用?mapred.compress.map.output=true mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodecHi,sergey我已经查看了这个链接。这没用。