Pentaho Hadoop文件输入

Pentaho Hadoop文件输入,hadoop,pentaho,data-integration,Hadoop,Pentaho,Data Integration,我正在尝试使用Pentaho Kettle(版本6.0.1.0-386)从独立的Hadoop(版本2.7.2默认配置的qith属性)HDFS检索数据。Pentaho和Hadoop不在同一台机器上,但我可以从一台访问另一台 我创建了一个具有以下属性的新“Hadoop文件输入”: 要求包含子文件夹的环境文件/文件夹通配符 文件的url N 文件的url的构建方式如下: ${PROTOCOL}://${USER}:${PASSWORD}@${IP}:${PORT}${PATH_TO_FILE} 例如:

我正在尝试使用Pentaho Kettle(版本6.0.1.0-386)从独立的Hadoop(版本2.7.2默认配置的qith属性)HDFS检索数据。Pentaho和Hadoop不在同一台机器上,但我可以从一台访问另一台

我创建了一个具有以下属性的新“Hadoop文件输入”:

要求包含子文件夹的环境文件/文件夹通配符 文件的url N

文件的url的构建方式如下: ${PROTOCOL}://${USER}:${PASSWORD}@${IP}:${PORT}${PATH_TO_FILE}

例如:hdfs://hadoop:@ip:50010/user/hadoop/red\u libelium/Ikusi/libelium\u waspmote\u AC\u libelium\u waspmote/libelium\u waspmote\u AC\u libelium\u waspmote.txt

密码为空

我检查并确认该文件存在于HDFS中,并通过web mannager和使用haddop命令行正确下载

场景A) 当我使用${PROTOCOL}=hdfs和${PORT}=50010时,Pentaho和Hadoop控制台中都出现了错误:

宾塔霍:

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016/04/05 15:23:46 - FileInputList - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : org.apache.commons.vfs2.FileSystemEx
ception: Could not list the contents of folder "hdfs://hadoop@172.21.0.35:50010/user/hadoop/red_libelium/Ikusi/libelium_waspmote_AC_2_libelium_waspmot
e/libelium_waspmote_AC_2_libelium_waspmote.txt".
2016/04/05 15:23:46 - FileInputList -   at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1193)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:243)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.core.fileinput.FileInputList.createFileList(FileInputList.java:142)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.steps.textfileinput.TextFileInputMeta.getTextFileList(TextFileInputMeta.java:1580)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.steps.textfileinput.TextFileInput.init(TextFileInput.java:1513)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.di.trans.step.StepInitThread.run(StepInitThread.java:69)
2016/04/05 15:23:46 - FileInputList -   at java.lang.Thread.run(Thread.java:745)
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException: End of File Exception between local host is: "EI001115/192.168.231.248"; destin
ation host is: "172.21.0.35":50010; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2016/04/05 15:23:46 - FileInputList -   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
2016/04/05 15:23:46 - FileInputList -   at com.sun.proxy.$Proxy70.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTrans
latorPB.java:554)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
2016/04/05 15:23:46 - FileInputList -   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2016/04/05 15:23:46 - FileInputList -   at java.lang.reflect.Method.invoke(Method.java:606)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
2016/04/05 15:23:46 - FileInputList -   at com.sun.proxy.$Proxy71.getListing(Unknown Source)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:126)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl$9.call(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.callAndWrapExceptions(HadoopFileSystemImpl
.java:200)
2016/04/05 15:23:46 - FileInputList -   at com.pentaho.big.data.bundles.impl.shim.hdfs.HadoopFileSystemImpl.listStatus(HadoopFileSystemImpl.java:124)
2016/04/05 15:23:46 - FileInputList -   at org.pentaho.big.data.impl.vfs.hdfs.HDFSFileObject.doListChildren(HDFSFileObject.java:115)
2016/04/05 15:23:46 - FileInputList -   at org.apache.commons.vfs2.provider.AbstractFileObject.getChildren(AbstractFileObject.java:1184)
2016/04/05 15:23:46 - FileInputList -   ... 6 more
2016/04/05 15:23:46 - FileInputList - Caused by: java.io.EOFException
2016/04/05 15:23:46 - FileInputList -   at java.io.DataInputStream.readInt(DataInputStream.java:392)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
2016/04/05 15:23:46 - FileInputList -   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
2016/04/05 15:23:48 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp
Hadoop:

2016-04-05 14:22:56,045 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: fiware-hadoop:50010:DataXceiver error processing unknown operation  src: /192.168.231.248:62961 dst: /172.21.0.35:50010
java.io.IOException: Version Mismatch (Expected: 28, Received: 26738 )
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:60)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
        at java.lang.Thread.run(Thread.java:745)
场景其他) 在其他情况下,使用不同的por编号(500709000…)我只是从Pentaho收到错误,Hadoop standalone似乎没有收到任何请求

阅读Pentaho的一些文档,似乎大数据插件是从Hadoop v 2.2.x构建的,因为我正在尝试连接到2.7.2。这可能是问题的根源吗? 有没有更高版本的插件? Os simply我的HDFS文件url错误


谢谢大家的时间,任何提示都将非常受欢迎。

我将自己回答这个问题,因为我解决了这个问题,而且这个问题太大了,无法简单评论

通过对Hadoop配置进行一些更改,解决了这个问题

  • 我更改了core-site.xml中的配置
  • 发件人:

    
    fs.default.name
    hdfs://hadoop:9000
    
    致:

    
    fs.default.name
    hdfs://server_ip_address:8020
    
    由于端口9000有问题,我最终更改为端口8020()

  • 打开端口8020(以防有防火墙规则)
  • Pentaho水壶转换url如下所示: ${PROTOCOL}://${USER}:${PASSWORD}@${HOST}:${PORT}${FILE_PATH} 现在${PORT}将是8020
  • 通过这种方式,我可以通过Pentaho转换预览HDFS中的数据

    谢谢大家抽出时间

    <property>
        <name>fs.default.name</name>
        <value>hdfs://hadoop:9000</value>
    </property>
    
    <property>
        <name>fs.default.name</name>
        <value>hdfs://server_ip_address:8020</value>
    </property>