Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop Flume将Kafka数据直接读取到配置单元事务表中,之后文件仍处于打开状态?_Hadoop_Apache Kafka_Hive_Flume - Fatal编程技术网

Hadoop Flume将Kafka数据直接读取到配置单元事务表中,之后文件仍处于打开状态?

Hadoop Flume将Kafka数据直接读取到配置单元事务表中,之后文件仍处于打开状态?,hadoop,apache-kafka,hive,flume,Hadoop,Apache Kafka,Hive,Flume,读取Kafka数据后,Flume直接进入配置单元事务表。为什么HDFS观察到配置单元文件的备份数据落在表中远远大于文件大小 CDH组件版本: Hadoop版本:3.0.0+CDH6.3.2卡夫卡版本:2.2.1+CDH6.3.2配置单元版本:2.1.1+CDH6.3.2配置单元版本:2.1.1+CDH6.3.2 问题描述: HDFS备份配置为2。通常,备份文件的大小是文件大小的两倍。这可以通过使用hdfs dfs-du-h命令看到 写入配置单元的文件的结构和大小: [test@datanode0

读取Kafka数据后,Flume直接进入配置单元事务表。为什么HDFS观察到配置单元文件的备份数据落在表中远远大于文件大小

CDH组件版本:

Hadoop版本:3.0.0+CDH6.3.2卡夫卡版本:2.2.1+CDH6.3.2配置单元版本:2.1.1+CDH6.3.2配置单元版本:2.1.1+CDH6.3.2

问题描述:

HDFS备份配置为2。通常,备份文件的大小是文件大小的两倍。这可以通过使用
hdfs dfs-du-h
命令看到

写入配置单元的文件的结构和大小:

[test@datanode06 ~]$hdfs dfs -du -h /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12
4        8      /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/_orc_acid_version
647.3 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377502_0377601
660.0 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377602_0377701
1.5 K    768 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377702_0377801
[test@datanode06 ~]$hdfs dfs -du -h /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/*
4        8      /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/_orc_acid_version
647.3 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377502_0377601/bucket_00000
660.0 K  1.3 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377602_0377701/bucket_00000
1.5 K    512 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377702_0377801/bucket_00000
8        256 M  /user/hive/warehouse/test.db/spark_error/dt=2021-05-28/hour=12/delta_0377702_0377801/bucket_00000_flush_length
这是分区,它滞后了小时=12个文档状态(分区之前,有这种情况),可以看到最后两个文件的大小明显不正常,推断是文件打开了,所以备份显示不正常,正常,文件写入应该关闭,但文件似乎已被占用,没有关闭,我不知道这是配置问题还是什么我不知道。在网上搜索,也没有看到反映这个问题

flume脚本:

aspark_error.sources= source_from_kafka
aspark_error.channels= mem_channel
aspark_error.sinks= hive_sink

## kafka source 
aspark_error.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource
aspark_error.sources.source_from_kafka.channels = mem_channel
aspark_error.sources.source_from_kafka.batchSize = 5000
aspark_error.sources.source_from_kafka.batchDurationMillis = 2000
aspark_error.sources.source_from_kafka.kafka.bootstrap.servers = kafka01:9092,kafka02:9092,kafka03:9092,kafka04:9092,kafka05:9092,kafka06:9092
aspark_error.sources.source_from_kafka.kafka.topics = spark_error
aspark_error.sources.source_from_kafka.kafka.consumer.group.id=spark_error_gid01
aspark_error.sources.source_from_kafka.kafka.consumer.auto.offset.reset=latest

## memory channel 
aspark_error.channels.mem_channel.type = memory
aspark_error.channels.mem_channel.capacity = 128000000
aspark_error.channels.mem_channel.transactionCapacity = 100000
aspark_error.channels.mem_channel.byteCapacityBufferPercentage = 20

## hive sink 
aspark_error.sinks.hive_sink.type=hive
aspark_error.sinks.hive_sink.hive.metastore=thrift://namenode01:9083
aspark_error.sinks.hive_sink.hive.database=test
aspark_error.sinks.hive_sink.hive.table=spark_error
aspark_error.sinks.hive_sink.hive.txnsPerBatchAsk=100
aspark_error.sinks.hive_sink.hive.partition=%Y-%m-%d,%H
aspark_error.sinks.hive_sink.batchSize=1000
aspark_error.sinks.hive_sink.round=true
aspark_error.sinks.hive_sink.roundUnit=minute
aspark_error.sinks.hive_sink.roundValue=10
aspark_error.sinks.hive_sink.serializer=JSON
aspark_error.sinks.hive_sink.serializer.fieldnames=exceptionMessage,exceptionType,runFlag,exceptionTime

## 
aspark_error.sources.source_from_kafka.channels=mem_channel
aspark_error.sinks.hive_sink.channel=mem_channel
配置单元事务表的HQL:

CREATE TABLE IF NOT EXISTS test.spark_error(
  exceptionMessage string, 
  exceptionType string, 
  runFlag string, 
  exceptionTime string
  )
partitioned by (dt string,hour string)
clustered by(exceptionType) into 1 buckets
row format delimited fields terminated by '\n'
stored as orc TBLPROPERTIES('transactional'='true');