我能';为了从hdfs加载数据,我使用python pyarrow库。在docker容器中
还配置了必要的参数,以便python可以从hdfsh读取:我能';为了从hdfs加载数据,我使用python pyarrow库。在docker容器中,python,apache-spark,hadoop,hdfs,pyarrow,Python,Apache Spark,Hadoop,Hdfs,Pyarrow,还配置了必要的参数,以便python可以从hdfsh读取: export ARROW_LIBHDFS_DIR='/opt/hadoop/lib/native' export HADOOP_COMMON_LIB_NATIVE_DIR='/opt/hadoop/lib/native' export HADOOP_OPTS="-Djava.library.path=/opt/hadoop/lib/" 对于ls$ARROW\u LIBHDFS\u DIR我得到了: libhadoop.a libh
export ARROW_LIBHDFS_DIR='/opt/hadoop/lib/native'
export HADOOP_COMMON_LIB_NATIVE_DIR='/opt/hadoop/lib/native'
export HADOOP_OPTS="-Djava.library.path=/opt/hadoop/lib/"
对于ls$ARROW\u LIBHDFS\u DIR
我得到了:
libhadoop.a libhadooppipes.a libhdfs.so libnativetask.so
libhadoop.so libhadooputils.a libhdfs.so.0.0.0 libnativetask.so.1.0.0
我的python代码:
import pandas as pd
pd.read_parquet('hdfs:///tmp/data/test.parquet', engine='pyarrow')
我得到的错误:
import pandas as pd
pd.read_parquet('hdfs:///tmp/data/test.parquet', engine='pyarrow')
WARN util.NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Message
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:225)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
IllegalStateException: java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:117)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 296, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 125, in read
path, columns=columns, **kwargs
File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1544, in read_table
partitioning=partitioning)
File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1173, in __init__
open_file_func=partial(_open_dataset_file, self._metadata)
File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1368, in _make_manifest
.format(path))
OSError: Passed non-file path: hdfs:///tmp/data/test.parquet
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet):getFileInfo错误:
ClassCastException:org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto不能转换为com.google.protobuf.Messagejava.lang.ClassCastException:org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto不能转换为com.google.protobuf.Message
位于org.apache.hadoop.ipc.protobufrpceengine$Invoker.invoke(protobufrpceengine.java:225)
位于org.apache.hadoop.ipc.protobufrpceengine$Invoker.invoke(protobufrpceengine.java:116)
位于com.sun.proxy.$Proxy11.getFileInfo(未知源)
位于org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
位于org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
位于com.sun.proxy.$Proxy12.getFileInfo(未知源)
位于org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
位于org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
位于org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
位于org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet):getFileInfo错误:
IllegalStateException:java.lang.IllegalStateException
位于com.google.common.base.premissions.checkState(premissions.java:129)
位于org.apache.hadoop.ipc.Client.setCallandRetryCount(Client.java:117)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
位于org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
位于com.sun.proxy.$Proxy12.getFileInfo(未知源)
位于org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
位于org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
位于org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
位于org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
位于org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py”,第296行,以read_-parquet格式
返回impl.read(路径,列=列,**kwargs)
文件“/usr/local/lib/python3.5/dist packages/pandas/io/parquet.py”,第125行,已读
路径,列=列,**kwargs
read_表中的文件“/usr/local/lib/python3.5/dist packages/pyarrow/parquet.py”,第1544行
分区=分区)
文件“/usr/local/lib/python3.5/dist packages/pyarrow/parquet.py”,第1173行,在__
open\u file\u func=partial(\u open\u dataset\u file,self.\u元数据)
文件“/usr/local/lib/python3.5/dist packages/pyarrow/parquet.py”,第1368行,在制造清单中
.格式(路径))
OSError:传递的非文件路径:hdfs:///tmp/data/test.parquet