Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Spark在`bytesInJava=self.\u jrdd.collect().iterator()处失败`_Python_Apache Spark - Fatal编程技术网

Python Spark在`bytesInJava=self.\u jrdd.collect().iterator()处失败`

Python Spark在`bytesInJava=self.\u jrdd.collect().iterator()处失败`,python,apache-spark,Python,Apache Spark,我在某些任务中使用spark时遇到问题 如果我只是在本地运行spark submit test.py,那么python的工作就完美了。 但是,当我执行spark提交时--master warn client test.py test.py看起来像: from pyspark import SparkContext,rdd sc = SparkContext(appName='behaviour') def train_rdd_op_log(type = 0,dir = '/user/xxxxx

我在某些任务中使用spark时遇到问题

如果我只是在本地运行
spark submit test.py
,那么python的工作就完美了。 但是,当我执行
spark提交时--master warn client test.py

test.py看起来像:

from pyspark import SparkContext,rdd
sc = SparkContext(appName='behaviour')

def train_rdd_op_log(type = 0,dir = '/user/xxxxx'):
    rdd_data = sc.textFile(dir)
    rdd_data = rdd_data.map(lambda x : training_format_text_log(x))
    return rdd_data

def training_format_text_log(text_log):
    text_log = text_log.split('\x01')
    return text_log

if __name__ == '__main__':
    result = train_rdd_op_log()
    result =result.collect()
    print result[0:10]
在每行从dir读取的文件如下所示:

u'aaa\x01bbb\x01ccc'

我只想用“\x01”分隔字符串

工作在
collect()
部分失败。错误消息是:

Traceback (most recent call last):

    File "/mnt/work/preprocessing/test.py", line 146, in <module>
    result = result.collect()
    File "/mnt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/lib/spark/python/pyspark/rdd.py", line 676, in collect
    bytesInJava = self._jrdd.collect().iterator()
    File "/mnt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
    File "/mnt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o29.collect.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 5, stats-hadoop23): java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110)
        at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
回溯(最近一次呼叫最后一次):
文件“/mnt/work/preprocessing/test.py”,第146行,在
result=result.collect()
文件“/mnt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/lib/spark/python/pyspark/rdd.py”,第676行,收集
bytesInJava=self.\u jrdd.collect().iterator()
文件“/mnt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py”,调用中第538行__
文件“/mnt/cloudera/parcels/CDH-5.3.8-1.cdh5.3.8.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py”,第300行,在get\u返回值中
py4j.protocol.Py4JJavaError:调用o29.collect时出错。
:org.apache.spark.SparkException:作业因阶段失败而中止:阶段0.0中的任务0失败4次,最近的失败:阶段0.0中的任务0.3丢失(TID 5,stats-hadoop23):java.net.socketxception:连接重置
位于java.net.SocketInputStream.read(SocketInputStream.java:196)
位于java.net.SocketInputStream.read(SocketInputStream.java:122)
在java.io.BufferedInputStream.fill处(BufferedInputStream.java:235)
在java.io.BufferedInputStream.read处(BufferedInputStream.java:254)
位于java.io.DataInputStream.readInt(DataInputStream.java:387)
位于org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110)
位于org.apache.spark.api.python.PythonRDD$$anon$1(PythonRDD.scala:174)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:263)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:230)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
位于org.apache.spark.scheduler.Task.run(Task.scala:56)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:196)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
运行(Thread.java:745)

好的,我想出来了。问题出现在spark submit上,它应该被称为:spark submit xx.py——master Thread client


xx.py文件后面应该紧跟着spark submit

如果我们没有看到您的代码,我们该如何帮助您?怎么样?谢谢,zero323。我认为这是python spark中的一个普遍问题。现在,我附加了最简洁的代码,返回了相同的错误。希望这会有所帮助。请让我知道我在哪里张贴的问题,以便我可以做一些进一步的更新。