Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PySpark在simple.map()调用时抛出TypeError_Python_Apache Spark_Yarn_Pyspark - Fatal编程技术网

Python PySpark在simple.map()调用时抛出TypeError

Python PySpark在simple.map()调用时抛出TypeError,python,apache-spark,yarn,pyspark,Python,Apache Spark,Yarn,Pyspark,我正在使用PySpark进行一些简单的转换,并不断遇到一个“bool”对象不可调用的错误。Spark版本是1.3.0 我在其他几个地方也遇到过这样的问题(如和),但建议只是验证主要的python版本在驱动程序和工作程序之间是否一致,我已经这样做了(每个版本都是python版本2.7.10的Anaconda发行版) 为了调试它,我一直在使用HDFS中存储的iris数据集: data = sc.textFile("/path/to/iris.csv") data.count() # works f

我正在使用PySpark进行一些简单的转换,并不断遇到一个“bool”对象不可调用的错误。Spark版本是1.3.0

我在其他几个地方也遇到过这样的问题(如和),但建议只是验证主要的python版本在驱动程序和工作程序之间是否一致,我已经这样做了(每个版本都是python版本2.7.10的Anaconda发行版)

为了调试它,我一直在使用HDFS中存储的iris数据集:

data = sc.textFile("/path/to/iris.csv")
data.count()  # works fine, returns 150
data.map(lambda x: x[:2])  # just subsets the string, works fine
data.map(lambda x: x.split(','))  # throws error below
调用
.collect()
.take()
.count()
时,这些(显然)失败,并对映射调用进行评估。因此,我基本上是在寻找任何进一步的想法/东西,以尝试正确配置东西

15/09/28 17:55:08 INFO YarnScheduler: Removed TaskSet 14.0, whose tasks have all completed, from pool: 
An error occurred while calling o135.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1     in stage 14.0 failed 4 times, most recent failure: Lost task 1.3 in stage 14.0:     org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py", line 101, in main
process()
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 270, in func
return f(iterator)
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 933, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 933, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "<stdin>", line 1, in <lambda>
**TypeError: 'bool' object is not callable**

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/09/28 17:55:08信息源调度程序:已从池中删除任务集14.0,其任务已全部完成:
调用o135.collect时出错。
:org.apache.spark.sparkeexception:作业因阶段失败而中止:阶段14.0中的任务1失败4次,最近的失败:阶段14.0中的任务1.3丢失:org.apache.spark.api.python异常:回溯(最近一次调用):
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py”,主行101
过程()
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py”,第96行,处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py”,第2253行,管道中
返回函数(拆分,上一个函数(拆分,迭代器))
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py”,第2253行,管道中
返回函数(拆分,上一个函数(拆分,迭代器))
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py”,第2253行,管道中
返回函数(拆分,上一个函数(拆分,迭代器))
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py”,第270行,func格式
返回f(迭代器)
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py”,第933行,in
返回self.mapPartitions(lambda i:[sum(i中的u为1)]).sum()
文件“/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py”,第933行,in
返回self.mapPartitions(lambda i:[sum(i中的u为1)]).sum()
文件“”,第1行,在
**TypeError:“bool”对象不可调用**
位于org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
位于org.apache.spark.api.python.PythonRDD$$anon$1(PythonRDD.scala:176)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:277)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:244)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
位于org.apache.spark.scheduler.Task.run(Task.scala:64)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:203)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
运行(Thread.java:745)
驱动程序堆栈跟踪:
位于org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
位于scala.collection.mutable.resizeblearray$class.foreach(resizeblearray.scala:59)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
位于org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
位于scala.Option.foreach(Option.scala:236)
位于org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

您是如何检查python版本的?通常PySpark将使用python的系统版本(/usr/bin/python),除非您明确指定PySpark_python环境变量.Hmm,基本上是通过检查路径python(
whichpython
)。我还尝试在PYSPARK-env.sh脚本中设置PYSPARK_PYTHON。但是您在哪里执行
哪种PYTHON
?如果您在主进程中执行它(您也在其中设置
SparkConf
SparkContext
),它将返回您想要的路径。但是,除非设置了
PYSPARK\u PYTHON
,否则工作人员可能会使用默认的系统PYTHON。为了我自己的理智,我总是确保在配置其他所有内容的相同脚本中设置
PYSPARK\u PYTHON
环境变量。看看这是否有帮助。你是如何检查python版本的?通常PySpark将使用python的系统版本(/usr/bin/python),除非您明确指定PySpark_python环境变量.Hmm,基本上是通过检查路径python(
whichpython
)。我还尝试在PYSPARK-env.sh脚本中设置PYSPARK_PYTHON。但是您在哪里执行
哪种PYTHON
?如果您在主进程中执行它(您也在其中设置
SparkConf
SparkContext
),它将返回您创建的路径