无法从Apache Nifi连接到Docker中的Hadoop

无法从Apache Nifi连接到Docker中的Hadoop,hadoop,apache-kafka,hdfs,apache-nifi,Hadoop,Apache Kafka,Hdfs,Apache Nifi,我正在尝试运行以下Apache Nifi流,并将数据从Kafka放入HDFS中: 我正在运行,我的Hadoop实例是Cloudera quickstart Cloudera快速启动 docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 7180:7180 -p 80:80 -p 50070:50070 -p 8020:8020 -p 50010:50010 -p 50020:50020

我正在尝试运行以下Apache Nifi流,并将数据从Kafka放入HDFS中:

我正在运行,我的Hadoop实例是Cloudera quickstart

Cloudera快速启动

docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 7180:7180 -p 80:80 -p 50070:50070 -p 8020:8020 -p 50010:50010 -p 50020:50020 -p 50075:50075 -p 50475:50475 -p 50090:50090 -p 50495:50495 -v $(pwd):/home/cloudera -w /home/cloudera cloudera/quickstart /usr/bin/docker-quickstart
合流卡夫卡

当Nifi试图将数据放入HDFS时,我收到以下错误。Nifi能够成功连接到HDFS(我的配置文件在下面供参考)

根据我最初的研究,namenode似乎无法与HDFS中的datanode通信,但我在HDFS-site.xml中的地址似乎是正确的。我还将端口暴露在我的机器上,这样Nifi就可以在不使用docker网络的情况下与Hadoop通信

org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutHDFS[id=07704347-0165-1000-b8f7-b53809532c9a]: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/users/.10180050823815 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1595)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3287)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2234)
    at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2179)
    at org.apache.nifi.processors.hadoop.PutHDFS$1.run(PutHDFS.java:299)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:360)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1942)
    at org.apache.nifi.processors.hadoop.PutHDFS.onTrigger(PutHDFS.java:229)
    at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
    at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165)
    at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203)
    at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException: File /topics/users/.10180050823815 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1595)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3287)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
    at org.apache.hadoop.ipc.Client.call(Client.java:1437)
    at org.apache.hadoop.ipc.Client.call(Client.java:1347)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy151.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:496)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
    at com.sun.proxy.$Proxy152.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1031)
    at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1865)
    at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1668)
    at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
我已使用以下配置文件设置HDFS实例:

核心站点.xml

<?xml version="1.0"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://10.0.1.28:8020</value>
  </property>

  <!-- OOZIE proxy user setting -->
  <property>
    <name>hadoop.proxyuser.oozie.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.oozie.groups</name>
    <value>*</value>
  </property>

  <!-- HTTPFS proxy user setting -->
  <property>
    <name>hadoop.proxyuser.httpfs.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.httpfs.groups</name>
    <value>*</value>
  </property>

  <!-- Llama proxy user setting -->
  <property>
    <name>hadoop.proxyuser.llama.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.llama.groups</name>
    <value>*</value>
  </property>

  <!-- Hue proxy user setting -->
  <property>
    <name>hadoop.proxyuser.hue.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hue.groups</name>
    <value>*</value>
  </property>

</configuration>
<?xml version="1.0"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <!-- Immediately exit safemode as soon as one DataNode checks in. 
       On a multi-node cluster, these configurations must be removed.  -->
  <property>
    <name>dfs.safemode.extension</name>
    <value>0</value>
  </property>
  <property>
     <name>dfs.safemode.min.datanodes</name>
     <value>1</value>
  </property>
  <property>
     <name>dfs.permissions.enabled</name>
     <value>false</value>
  </property>
  <property>
     <name>dfs.permissions</name>
     <value>false</value>
  </property>
  <property>
     <name>dfs.safemode.min.datanodes</name>
     <value>1</value>
  </property>
  <property>
     <name>dfs.webhdfs.enabled</name>
     <value>true</value>
  </property>
  <property>
     <name>hadoop.tmp.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/${user.name}</value>
  </property>
  <property>
     <name>dfs.namenode.name.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/${user.name}/dfs/name</value>
  </property>
  <property>
     <name>dfs.namenode.checkpoint.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/${user.name}/dfs/namesecondary</value>
  </property>
  <property>
     <name>dfs.datanode.data.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/${user.name}/dfs/data</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-bind-host</name>
    <value>10.0.1.28</value>
  </property>

  <property>
    <name>dfs.namenode.servicerpc-address</name>
    <value>10.0.1.28:8022</value>
  </property>
  <property>
    <name>dfs.https.address</name>
    <value>10.0.1.28:50470</value>
  </property>
  <property>
    <name>dfs.namenode.http-address</name>
    <value>10.0.1.28:50070</value>
  </property>
  <property>
    <name>dfs.datanode.address</name>
    <value>10.0.1.28:50010</value>
  </property>
  <property>
    <name>dfs.datanode.ipc.address</name>
    <value>10.0.1.28:50020</value>
  </property>
  <property>
    <name>dfs.datanode.http.address</name>
    <value>10.0.1.28:50075</value>
  </property>
  <property>
    <name>dfs.datanode.https.address</name>
    <value>10.0.1.28:50475</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>10.0.1.28:50090</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.https-address</name>
    <value>10.0.1.28:50495</value>
  </property>

  <!-- Impala configuration -->
  <property>
    <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.client.file-block-storage-locations.timeout.millis</name>
    <value>10000</value>
  </property>
  <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hadoop-hdfs/dn._PORT</value>
  </property>
</configuration>

如果为
PutHDFS
配置了
Directory
属性,其路径类似于
/path/to/your/files
,则可能需要将其更改为完整的路径,与NameNode hostname/IP和端口类似的含义
hdfs://namenode-host:port/path/to/your/files

如果您的NiFi容器位于同一Docker网络上,则不应使用硬编码的IP地址

我的建议是编辑合并的撰写文件(或制作一个单独的撰写文件),并使用
docker run命令
将其重新塑造为撰写形式

比如说

cloudera-cdh:
  image: cloudera/quickstart 
  command: /usr/bin/docker-quickstart
  ports:
    - ...
  volumes: 
    - $PWD:/home/cloudera
对NiFi容器执行同样的操作

然后,您的
hdfs site.xml
文件应该能够访问
hdfs://cloudera-cdh:50070
通过服务名称通过Docker网络


注意:您可以使用
docker network create[name]
完成类似任务,并通过
docker run
传递
--network[name]



FWIW,如果您只需要HDFS,那么有更好的Hadoop容器不包含完整的CDH堆栈。(
bde2020
uhopper
图像)

为什么不使用Confluent的HDFS Kafka连接器?@cricket_007我想标准化所有通过NiFi的数据路由和流,而不必管理连接器。我计划稍后添加其他数据源,这些数据源将在HDFS中存储数据。您的初始搜索是正确的。名称节点无法与datanode通信。因此,这可能是一些网络问题或配置问题。现在检查防火墙问题,或者提及it核心站点.xml、纱线站点.xml、映射站点.xml、hdfs-site.xml中的4个配置文件。在此之前,请确保namenode可以从该namenode计算机本身与数据节点通信。这很公平。尽管如此,您可以使用NiFi将这些外部资源放入卡夫卡,而不是仅使用NiFi从中消费。我个人发现connect比NiFi扩展得更好。通过修复我的docker compose文件,正如你提到的,我设法让NiFi写入HDFS。不幸的是,我的cloudera容器一直以代码0退出。我想这可能是因为我的机器上的资源限制(MBPI7 4核,16gb ram)。我在Nifi上没有任何错误,当我查看Cloudera的日志时,最后两行看起来像是
Cloudera |启动的Impala服务器(impalad):[OK]Cloudera退出时代码为0
,同样关于Nifi与Kafka-我将尝试使用连接器,看看哪个更好。因为在我的体系结构中,Kafka是主流处理引擎,所以您关于使用Nifi将数据加载到Kafka中的观点是正确的。我最初让Kafka->HDFS以这种方式工作,但没有尝试使用停靠的Kafka和HDFS。修复了我自己在cloudera关闭时遇到的问题:true。
cloudera-cdh:
  image: cloudera/quickstart 
  command: /usr/bin/docker-quickstart
  ports:
    - ...
  volumes: 
    - $PWD:/home/cloudera