Hadoop Flume将Kafka数据直接读取到配置单元事务表中，之后文件仍处于打开状态？_Hadoop_Apache Kafka_Hive_Flume

Hadoop Flume将Kafka数据直接读取到配置单元事务表中，之后文件仍处于打开状态？

hadoop apache-kafka hive

Hadoop Flume将Kafka数据直接读取到配置单元事务表中，之后文件仍处于打开状态？,hadoop,apache-kafka,hive,flume,Hadoop,Apache Kafka,Hive,Flume,读取Kafka数据后，Flume直接进入配置单元事务表。为什么HDFS观察到配置单元文件的备份数据落在表中远远大于文件大小 CDH组件版本： Hadoop版本：3.0.0+CDH6.3.2卡夫卡版本：2.2.1+CDH6.3.2配置单元版本：2.1.1+CDH6.3.2配置单元版本：2.1.1+CDH6.3.2 问题描述： HDFS备份配置为2。通常，备份文件的大小是文件大小的两倍。这可以通过使用hdfs dfs-du-h命令看到写入配置单元的文件的结构和大小： [test@datanode0

读取Kafka数据后，Flume直接进入配置单元事务表。为什么HDFS观察到配置单元文件的备份数据落在表中远远大于文件大小

CDH组件版本：

Hadoop版本：3.0.0+CDH6.3.2卡夫卡版本：2.2.1+CDH6.3.2配置单元版本：2.1.1+CDH6.3.2配置单元版本：2.1.1+CDH6.3.2

问题描述：

HDFS备份配置为2。通常，备份文件的大小是文件大小的两倍。这可以通过使用

hdfs dfs-du-h

命令看到

写入配置单元的文件的结构和大小：

[test@datanode06 ~]$hdfs dfs -du -h /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12
4        8      /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/_orc_acid_version
647.3 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377502_0377601
660.0 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377602_0377701
1.5 K    768 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377702_0377801
[test@datanode06 ~]$hdfs dfs -du -h /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/*
4        8      /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/_orc_acid_version
647.3 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377502_0377601/bucket_00000
660.0 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377602_0377701/bucket_00000
1.5 K    512 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377702_0377801/bucket_00000
8        256 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377702_0377801/bucket_00000_flush_length

这是分区，它滞后了小时=12个文档状态（分区之前，有这种情况），可以看到最后两个文件的大小明显不正常，推断是文件打开了，所以备份显示不正常，正常，文件写入应该关闭，但文件似乎已被占用，没有关闭，我不知道这是配置问题还是什么我不知道。在网上搜索，也没有看到反映这个问题

flume脚本：

aspark_error.sources= source_from_kafka
aspark_error.channels= mem_channel
aspark_error.sinks= hive_sink

## kafka source 
aspark_error.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource
aspark_error.sources.source_from_kafka.channels = mem_channel
aspark_error.sources.source_from_kafka.batchSize = 5000
aspark_error.sources.source_from_kafka.batchDurationMillis = 2000
aspark_error.sources.source_from_kafka.kafka.bootstrap.servers = kafka01:9092,kafka02:9092,kafka03:9092,kafka04:9092,kafka05:9092,kafka06:9092
aspark_error.sources.source_from_kafka.kafka.topics = spark_error
aspark_error.sources.source_from_kafka.kafka.consumer.group.id=spark_error_gid01
aspark_error.sources.source_from_kafka.kafka.consumer.auto.offset.reset=latest

## memory channel 
aspark_error.channels.mem_channel.type = memory
aspark_error.channels.mem_channel.capacity = 128000000
aspark_error.channels.mem_channel.transactionCapacity = 100000
aspark_error.channels.mem_channel.byteCapacityBufferPercentage = 20

## hive sink 
aspark_error.sinks.hive_sink.type=hive
aspark_error.sinks.hive_sink.hive.metastore=thrift://namenode01:9083
aspark_error.sinks.hive_sink.hive.database=test
aspark_error.sinks.hive_sink.hive.table=spark_error
aspark_error.sinks.hive_sink.hive.txnsPerBatchAsk=100
aspark_error.sinks.hive_sink.hive.partition=%Y-%m-%d,%H
aspark_error.sinks.hive_sink.batchSize=1000
aspark_error.sinks.hive_sink.round=true
aspark_error.sinks.hive_sink.roundUnit=minute
aspark_error.sinks.hive_sink.roundValue=10
aspark_error.sinks.hive_sink.serializer=JSON
aspark_error.sinks.hive_sink.serializer.fieldnames=exceptionMessage,exceptionType,runFlag,exceptionTime

## 
aspark_error.sources.source_from_kafka.channels=mem_channel
aspark_error.sinks.hive_sink.channel=mem_channel

配置单元事务表的HQL：

CREATE TABLE IF NOT EXISTS test.spark_error(
  exceptionMessage string, 
  exceptionType string, 
  runFlag string, 
  exceptionTime string
  )
partitioned by (dt string,hour string)
clustered by(exceptionType) into 1 buckets
row format delimited fields terminated by '\n'
stored as orc TBLPROPERTIES('transactional'='true');