我能';为了从hdfs加载数据,我使用python pyarrow库。在docker容器中

我能';为了从hdfs加载数据,我使用python pyarrow库。在docker容器中,python,apache-spark,hadoop,hdfs,pyarrow,Python,Apache Spark,Hadoop,Hdfs,Pyarrow,还配置了必要的参数,以便python可以从hdfsh读取: export ARROW_LIBHDFS_DIR='/opt/hadoop/lib/native' export HADOOP_COMMON_LIB_NATIVE_DIR='/opt/hadoop/lib/native' export HADOOP_OPTS="-Djava.library.path=/opt/hadoop/lib/" 对于ls$ARROW\u LIBHDFS\u DIR我得到了: libhadoop.a libh

还配置了必要的参数,以便python可以从hdfsh读取:

export ARROW_LIBHDFS_DIR='/opt/hadoop/lib/native'
export HADOOP_COMMON_LIB_NATIVE_DIR='/opt/hadoop/lib/native'
export HADOOP_OPTS="-Djava.library.path=/opt/hadoop/lib/"
对于
ls$ARROW\u LIBHDFS\u DIR
我得到了:

libhadoop.a   libhadooppipes.a    libhdfs.so        libnativetask.so
libhadoop.so  libhadooputils.a    libhdfs.so.0.0.0  libnativetask.so.1.0.0
我的python代码:

import pandas as pd
pd.read_parquet('hdfs:///tmp/data/test.parquet', engine='pyarrow')
我得到的错误:

import pandas as pd
pd.read_parquet('hdfs:///tmp/data/test.parquet', engine='pyarrow')
WARN util.NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类

 hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
    ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Message
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:225)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
            at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
            at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
            at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
            at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
            at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
            at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
    hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
    IllegalStateException: java.lang.IllegalStateException
            at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
            at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:117)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
            at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
            at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
            at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
            at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 296, in read_parquet
        return impl.read(path, columns=columns, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 125, in read
        path, columns=columns, **kwargs
      File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1544, in read_table
        partitioning=partitioning)
      File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1173, in __init__
        open_file_func=partial(_open_dataset_file, self._metadata)
      File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1368, in _make_manifest
        .format(path))
    OSError: Passed non-file path: hdfs:///tmp/data/test.parquet
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet):getFileInfo错误:
ClassCastException:org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto不能转换为com.google.protobuf.Messagejava.lang.ClassCastException:org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto不能转换为com.google.protobuf.Message
位于org.apache.hadoop.ipc.protobufrpceengine$Invoker.invoke(protobufrpceengine.java:225)
位于org.apache.hadoop.ipc.protobufrpceengine$Invoker.invoke(protobufrpceengine.java:116)
位于com.sun.proxy.$Proxy11.getFileInfo(未知源)
位于org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
位于org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
位于com.sun.proxy.$Proxy12.getFileInfo(未知源)
位于org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
位于org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
位于org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
位于org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet):getFileInfo错误:
IllegalStateException:java.lang.IllegalStateException
位于com.google.common.base.premissions.checkState(premissions.java:129)
位于org.apache.hadoop.ipc.Client.setCallandRetryCount(Client.java:117)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
位于com.sun.proxy.$Proxy12.getFileInfo(未知源)
位于org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
位于org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
位于org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
位于org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py”,第296行,以read_-parquet格式
返回impl.read(路径,列=列,**kwargs)
文件“/usr/local/lib/python3.5/dist packages/pandas/io/parquet.py”,第125行,已读
路径,列=列,**kwargs
read_表中的文件“/usr/local/lib/python3.5/dist packages/pyarrow/parquet.py”,第1544行
分区=分区)
文件“/usr/local/lib/python3.5/dist packages/pyarrow/parquet.py”,第1173行,在__
open\u file\u func=partial(\u open\u dataset\u file,self.\u元数据)
文件“/usr/local/lib/python3.5/dist packages/pyarrow/parquet.py”,第1368行,在制造清单中
.格式(路径))
OSError:传递的非文件路径:hdfs:///tmp/data/test.parquet