Python 如何从pyspark访问org.apache.hadoop.fs.FileUtil?
我试图直接从pyspark shell访问org.apache.hadoop.fs.FileUtil.unTar 我知道我可以访问底层虚拟机(通过py4j)Python 如何从pyspark访问org.apache.hadoop.fs.FileUtil?,python,hadoop,apache-spark,pyspark,py4j,Python,Hadoop,Apache Spark,Pyspark,Py4j,我试图直接从pyspark shell访问org.apache.hadoop.fs.FileUtil.unTar 我知道我可以访问底层虚拟机(通过py4j)sc.\u jvm来实现这一点,但我很难真正连接到hdfs(尽管我的pyspark会话在其他方面完全正常,并且能够针对集群内的作业跨集群运行作业) 例如: hdpUntar = sc._jvm.org.apache.hadoop.fs.FileUtil.unTar hdpFile = sc._jvm.java.io.File root
sc.\u jvm
来实现这一点,但我很难真正连接到hdfs(尽管我的pyspark会话在其他方面完全正常,并且能够针对集群内的作业跨集群运行作业)
例如:
hdpUntar = sc._jvm.org.apache.hadoop.fs.FileUtil.unTar
hdpFile = sc._jvm.java.io.File
root = hdpFile("hdfs://<url>/user/<file>")
target = hdpFile("hdfs://<url>/user/myuser/untar")
hdpUntar(root, target)
后来,在scala中尝试了它——看起来代码只是在本地将其导出。
Py4JJavaError: An error occurred while calling z:org.apache.hadoop.fs.FileUtil.unTar.
: ExitCodeException exitCode=128: tar: Cannot connect to hdfs: resolve failed
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.unTarUsingTar(FileUtil.java:675)
at org.apache.hadoop.fs.FileUtil.unTar(FileUtil.java:651)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)