Hadoop 使用hive.optimize.sort.dynamic.partition选项避免单个文件

Hadoop 使用hive.optimize.sort.dynamic.partition选项避免单个文件,hadoop,hive,hiveql,reducers,hive-configuration,Hadoop,Hive,Hiveql,Reducers,Hive Configuration,我在用蜂箱 当我使用插入查询编写动态分区并启用hive.optimize.sort.dynamic.partition选项(设置hive.optimize.sort.dynamic.partition=true)时,每个分区中总是有一个文件 但如果我打开该选项(SET-hive.optimize.sort.dynamic.partition=false),就会出现这样的内存不足异常 TaskAttempt 3 failed, info=[Error: Error while running ta

我在用蜂箱

当我使用插入查询编写动态分区并启用hive.optimize.sort.dynamic.partition选项(
设置hive.optimize.sort.dynamic.partition=true
)时,每个分区中总是有一个文件

但如果我打开该选项(
SET-hive.optimize.sort.dynamic.partition=false
),就会出现这样的内存不足异常

TaskAttempt 3 failed, info=[Error: Error while running task ( failure ) : attempt_1534502930145_6994_1_01_000008_3:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:194)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
        at org.apache.parquet.column.values.dictionary.IntList.<init>(IntList.java:86)
        at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:93)
        at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:229)
        at org.apache.parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:131)
        at org.apache.parquet.column.ParquetProperties.dictWriterWithFallBack(ParquetProperties.java:178)
        at org.apache.parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:203)
        at org.apache.parquet.column.impl.ColumnWriterV1.<init>(ColumnWriterV1.java:83)
        at org.apache.parquet.column.impl.ColumnWriteStoreV1.newMemColumn(ColumnWriteStoreV1.java:68)
        at org.apache.parquet.column.impl.ColumnWriteStoreV1.getColumnWriter(ColumnWriteStoreV1.java:56)
        at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:184)
        at org.apache.parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:376)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:109)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:99)
        at org.apache.parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:100)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:327)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.<init>(ParquetRecordWriterWrapper.java:67)
        at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getParquerRecordWriterWrapper(MapredParquetOutputFormat.java:128)
        at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:117)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:619)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:563)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createNewPaths(FileSinkOperator.java:867)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:975)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:715)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:287)
        at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317)
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:299, Vertex vertex_1534502930145_6994_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1534502930145_6994_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:27, Vertex vertex_1534502930145_6994_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1

distributed by partition key
有助于解决OOM问题,但此配置可能会导致每个reducer写入整个分区,具体取决于
hive.exec.reducers.bytes.per.reducer
配置,默认情况下可以设置非常高的值,如1Gb<代码>按分区键分发可能会导致额外的reduce阶段,hive.optimize.sort.dynamic.partition也是如此

因此,为了避免OOM并实现最大性能:

  • 在插入查询的末尾添加
    按分区键分发
    ,这将导致相同的分区键由相同的缩减器处理。或者,或者除了此设置之外,还可以使用
    hive.optimize.sort.dynamic.partition=true
  • hive.exec.reducers.bytes.per.reducer
    设置为 如果一个分区中的数据太多,则触发更多的还原程序。只需检查hive.exec.reducers.bytes.per.reducer的当前值是多少,并相应地减少或增加它以获得适当的reducer并行性。此设置将确定单个reducer将处理多少数据以及每个分区将创建多少文件
  • 例如:

    set hive.exec.reducers.bytes.per.reducer=33554432;
    
    insert overwrite table partition (load_date)
    select * from src_table
    distribute by load_date;
    

    另请参见关于控制映射器和还原器数量的回答:

    最后我发现了问题所在

    首先,执行引擎是tez
    mapreduce.reduce.memory.mb
    选项没有帮助。您应该使用
    hive.tez.container.size
    选项。写入动态分区时,reducer会打开多个记录写入程序。Reducer需要足够的内存来同时写入多个分区

    如果使用
    hive.optimize.sort.dynamic.partition
    选项,则会运行全局分区排序,但排序意味着存在减缩器。在这种情况下,如果没有其他reducer任务,则每个分区由一个reducer处理。这就是分区中只有一个文件的原因。通过生成更多reduce任务进行分发,这样可以在每个分区中生成更多文件,但内存问题是相同的


    因此,容器内存大小非常重要!不要忘记使用
    hive.tez.container.size
    选项更改tez容器内存大小

    我设置了
    hive.optimize.sort.dynamic.partition=true
    hive.exec.reducers.bytes.per.reducer=1024
    ,但分区中仍然有一个文件(超过10GB)。值越高,如1048576、10485760,sush值越小,因为512没有进行任何更改。@JuhongJung还尝试使用distribute by,每个减速机包含字节数。没有hive.optimize.sort.dynamic.partition。将每个reducer的字节数设置为如此小的值(10485760=10M,它比一个块还小)有什么用呢?我尝试了
    设置hive.optimize.sort.dynamic.partition=false
    设置hive.exec.reducers.bytes.per.reducer=1048576
    并使用
    按事件分发时间戳_日期
    ,但它导致了内存异常。我对我的问题添加了示例查询。请检查:)非常感谢!这:
    设置mapred.reduce.tasks=300-它可能会覆盖bytes.per.reducer并强制使用300个reducer。已经启动了多少个还原程序?并按事件\u时间戳\u日期分发+另外一列(与分区没有太多关联)肯定会为每个还原程序创建多个文件partition@leftjoin对不起,迟了答复。我尝试了按事件、时间戳、日期+另外一列(与分区没有太大关系)和变量
    分配
    和变量
    hive.exec.reducers.bytes.per.reducer
    作为
    1024
    分配到
    104857600
    ,但分区中总是只有一个文件。例如,分区中有超过10GB的文件。我无法理解这种情况,因为有1000多个reduce task(vertices)task,但结果是一个文件我尝试了作为mr的执行引擎,结果是一样的。
    set hive.exec.reducers.bytes.per.reducer=33554432;
    
    insert overwrite table partition (load_date)
    select * from src_table
    distribute by load_date;