Python macOS上的pyspark熊猫UDF EOFError

Python macOS上的pyspark熊猫UDF EOFError,python,pandas,apache-spark,pyspark,user-defined-functions,Python,Pandas,Apache Spark,Pyspark,User Defined Functions,在macOS(Big Sur)上运行pandas UDF会导致以下错误,而完全相同的代码在Google Colab上运行得非常好。此外,spark UDF工作良好 20/12/09 14:02:22 ERROR ArrowPythonRunner: Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last):

在macOS(Big Sur)上运行pandas UDF会导致以下错误,而完全相同的代码在Google Colab上运行得非常好。此外,spark UDF工作良好

20/12/09 14:02:22 ERROR ArrowPythonRunner: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 585, in main
    eval_type = read_int(infile)
  File "/usr/local/opt/apache-spark/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 595, in read_int
    raise EOFError
EOFError

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
    at io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:490)
    at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
    at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
    at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
    at org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
    at org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240)
    at org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
    at org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
在colab上设置pyspark的附加代码:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"
当地环境

findspark==1.4.2
numpy==1.18.5
pandas==1.1.4
pyarrow==2.0.0
pyspark==3.0.1
  • Python:3.8.5
  • 平台“:macOS-10.16-x86_64-i386-64位
  • OpenJDK 64位服务器虚拟机,Java 11.0.9

    • 问题出在Java 11上。供将来参考:降级到Java 8解决了这个问题。

      这对我来说很有效。…)
      findspark==1.4.2
      numpy==1.18.5
      pandas==1.1.4
      pyarrow==2.0.0
      pyspark==3.0.1