Apache spark 为什么PySpark会随机失败；“插座已关闭”；错误？_Apache Spark_Pyspark

Apache spark 为什么PySpark会随机失败；“插座已关闭”；错误？

apache-spark pyspark

Apache spark 为什么PySpark会随机失败；“插座已关闭”；错误？,apache-spark,pyspark,Apache Spark,Pyspark,我刚刚参加了PySpark培训课程，我正在编译代码示例行的脚本（这解释了为什么代码块什么都不做）。每次运行此代码时，都会出现一两次此错误。抛出它的线在两次运行之间发生变化。我已尝试设置spark.executor.memory和spark.executor.heartbeatInterval，但错误仍然存在。我还尝试将.cache（）放在不同行的末尾，没有任何更改错误： 16/09/21 10:29:32 ERROR Utils: Uncaught exception in thread st

我刚刚参加了PySpark培训课程，我正在编译代码示例行的脚本（这解释了为什么代码块什么都不做）。每次运行此代码时，都会出现一两次此错误。抛出它的线在两次运行之间发生变化。我已尝试设置

spark.executor.memory

和

spark.executor.heartbeatInterval

，但错误仍然存在。我还尝试将

.cache（）

放在不同行的末尾，没有任何更改

错误：

16/09/21 10:29:32 ERROR Utils: Uncaught exception in thread stdout writer for python
java.net.SocketException: Socket is closed
        at java.net.Socket.shutdownOutput(Socket.java:1551)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply$mcV$sp(PythonRDD.scala:344)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply(PythonRDD.scala:344)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3$$anonfun$apply$4.apply(PythonRDD.scala:344)
        at org.apache.spark.util.Utils$.tryLog(Utils.scala:1870)
        at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:344)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
        at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

守则：

from pyspark import SparkConf, SparkContext

def parseLine(line):
    fields = line.split(',')
    return (int(fields[0]), float(fields[2]))

def parseGraphs(line):
    fields = line.split()
    return (fields[0]), [int(n) for n in fields[1:]]

# putting the [*] after local makes it run one executor on each core of your local PC
conf = SparkConf().setMaster("local[*]").setAppName("MyProcessName")

sc = SparkContext(conf = conf)

# parse the raw data and map it to an rdd.
# each item in this rdd is a tuple
# two methods to get the exact same data:
########## All of these methods can use lambda or full methods in the same way ##########
# read in a text file
customerOrdersLines = sc.textFile("file:///SparkCourse/customer-orders.csv")
customerOrdersRdd = customerOrdersLines.map(parseLine)
customerOrdersRdd = customerOrdersLines.map(lambda l: (int(l.split(',')[0]), float(l.split(',')[2])))
print customerOrdersRdd.take(1)

# countByValue groups identical values and counts them
salesByCustomer = customerOrdersRdd.map(lambda sale: sale[0]).countByValue()
print salesByCustomer.items()[0]

# use flatMap to cut everything up by whitespace
bookText = sc.textFile("file:///SparkCourse/Book.txt")
bookRdd = bookText.flatMap(lambda l: l.split())
print bookRdd.take(1)

# create key/value pairs that will allow for more complex uses
names = sc.textFile("file:///SparkCourse/marvel-names.txt")
namesRdd = names.map(lambda line: (int(line.split('\"')[0]), line.split('\"')[1].encode("utf8")))
print namesRdd.take(1)

graphs = sc.textFile("file:///SparkCourse/marvel-graph.txt")
graphsRdd = graphs.map(parseGraphs)
print graphsRdd.take(1)

# this will append "extra text" to each name.
# this is faster than a normal map because it doesn't give you access to the keys
extendedNamesRdd = namesRdd.mapValues(lambda heroName: heroName + "extra text")
print extendedNamesRdd.take(1)

# not the best example because the costars is already a list of integers
# but this should return a list, which will update the values
flattenedCostarsRdd = graphsRdd.flatMapValues(lambda costars: costars)
print flattenedCostarsRdd.take(1)

# put the heroes in ascending index order
sortedHeroes = namesRdd.sortByKey()
print sortedHeroes.take(1)

# to sort heroes by alphabetical order, we switch key/value to value/key, then sort
alphabeticalHeroes = namesRdd.map(lambda (key, value): (value, key)).sortByKey()
print alphabeticalHeroes.take(1)

# make sure that "spider" is in the name of the hero
spiderNames = namesRdd.filter(lambda (id, name): "spider" in name.lower())
print spiderNames.take(1)

# reduce by key keeps the key and performs aggregation methods on the values.  in this example, taking the sum
combinedGraphsRdd = flattenedCostarsRdd.reduceByKey(lambda value1, value2: value1 + value2)
print combinedGraphsRdd.take(1)

# broadcast: this is accessible from any executor
sentData = sc.broadcast(["this can be accessed by all executors", "access it using sentData"])

# accumulator:  this is synced across all executors
hitCounter = sc.accumulator(0)

免责声明：我在Spark的代码库的这一部分上花的时间不够，但让我给你一些可能导致解决方案的提示。下面只是解释在哪里搜索更多信息，而不是问题的解决方案

您面临的异常是由于代码中看到的一些其他问题造成的（您可以从

java.net.Socket.shutdownOutput（Socket.java:1551）

行中看到，这是在执行

worker.shutdownOutput（）

时发生的）

这让我相信，这个错误是其他早期错误的后续

python的stdoutwriter的名称是（使用物理运算符和）负责Spark和pyspark之间的通信（因此您可以在不做太多更改的情况下执行python代码）

事实上，它提供了很多关于pyspark内部使用的底层通信基础设施的信息，并且pyspark将套接字用于外部Python进程

Python评估的工作原理是通过套接字将必要的（投影的）输入数据发送到外部Python进程，并将Python进程的结果与原始行相结合

此外，默认情况下使用

python

，除非使用

PYSPARK\u-DRIVER\u-python

或

PYSPARK\u-python

重写（如

PYSPARK

shell脚本和中所示）。这是出现在失败线程名称中的名称

16/09/21 10:29:32错误Utils:python的线程标准输出编写器中未捕获异常

我建议使用以下命令在您的系统上检查python的版本

python -c 'import sys; print(sys.version_info)'

，但可能是您使用的最新Python没有经过Spark的良好测试。猜

您应该包括pyspark应用程序执行的整个日志，我希望从中找到答案。

您能告诉我它在哪一步返回错误吗？你们有人打印作品吗？你们可能混淆了源端口和目标端口。默认连接模式

Any（可用）>>目标端口

，可能默认端口为80，则无法连接到80个端口。我强烈建议您使用Wireshark检查客户端和服务器连接。Spark版本是什么？您能否启动

pyspark

并键入一些没有错误的命令？这是窗户，不是吗？如何执行上述代码？您的机器上是否安装了python？

python -c 'import sys; print(sys.version_info)'