Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Apache Spark无法查看输出_Apache Spark_Pyspark_Databricks - Fatal编程技术网

Apache spark Apache Spark无法查看输出

Apache spark Apache Spark无法查看输出,apache-spark,pyspark,databricks,Apache Spark,Pyspark,Databricks,我刚刚开始学习ApacheSpark。我试图打印链接的输出,但由于某种原因,它没有显示出来。我也尝试过links.collect()、display(links),但都不起作用。任何帮助都将不胜感激 第二个映像的完整堆栈跟踪: Py4JJavaError Traceback (most recent call last) <ipython-input-34-01e857cfa45e> in <module>

我刚刚开始学习ApacheSpark。我试图打印链接的输出,但由于某种原因,它没有显示出来。我也尝试过links.collect()、display(links),但都不起作用。任何帮助都将不胜感激

第二个映像的完整堆栈跟踪:

 Py4JJavaError                             Traceback (most recent call last)
  <ipython-input-34-01e857cfa45e> in <module>()
  ----> 1 for link in links.collect():
        2         print("%s" %(link))

  /databricks/spark/python/pyspark/rdd.py in collect(self)
      769         """
      770         with SCCallSiteSync(self.context) as css:
  --> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
      772         return list(_load_from_socket(port, self._jrdd_deserializer))
      773 

  /databricks/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
      811         answer = self.gateway_client.send_command(command)
      812         return_value = get_return_value(
  --> 813             answer, self.gateway_client, self.target_id, self.name)
      814 
      815         for temp_arg in temp_args:

  /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
       43     def deco(*a, **kw):
       44         try:
  ---> 45             return f(*a, **kw)
       46         except py4j.protocol.Py4JJavaError as e:
       47             s = e.java_exception.toString()

  /databricks/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
      306                 raise Py4JJavaError(
      307                     "An error occurred while calling {0}{1}{2}.\n".
  --> 308                     format(target_id, ".", name), value)
      309             else:
      310                 raise Py4JError(

  Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 24.0 failed 1 times, most recent failure: Lost task 4.0 in stage 24.0 (TID 76, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
    File "/databricks/spark/python/pyspark/worker.py", line 111, in main
      process()
    File "/databricks/spark/python/pyspark/worker.py", line 106, in process
      serializer.dump_stream(func(split_index, iterator), outfile)
    File "/databricks/spark/python/pyspark/rdd.py", line 2346, in pipeline_func
      return func(split, prev_func(split, iterator))
    File "/databricks/spark/python/pyspark/rdd.py", line 2346, in pipeline_func
      return func(split, prev_func(split, iterator))
    File "/databricks/spark/python/pyspark/rdd.py", line 317, in func
      return f(iterator)
    File "/databricks/spark/python/pyspark/rdd.py", line 1776, in combineLocally
      merger.mergeValues(iterator)
    File "/databricks/spark/python/pyspark/shuffle.py", line 236, in mergeValues
      for k, v in iterator:
    File "<ipython-input-31-4b09041aa30b>", line 1, in <lambda>
    File "<ipython-input-28-f43debc22073>", line 3, in parseNeighbors
    File "/databricks/python/lib/python2.7/re.py", line 171, in split
      return _compile(pattern, flags).split(string, maxsplit)
  TypeError: expected string or buffer

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
    at org.apache.spark.scheduler.Task.run(Task.scala:96)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Py4JJavaError回溯(最近一次调用)
在()
---->1表示links.collect()中的链接:
2打印(“%s”%(链接))
/collect中的databricks/spark/python/pyspark/rdd.py(self)
769         """
770,使用SCCallSiteSync(self.context)作为css:
-->771 port=self.ctx.\u jvm.PythonRDD.collectAndServe(self.\u jrdd.rdd())
772返回列表(_从_套接字加载(端口,self._jrdd_反序列化器))
773
/databricks/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py在调用中(self,*args)
811 answer=self.gateway\u client.send\u命令(command)
812返回值=获取返回值(
-->813应答,self.gateway\u客户端,self.target\u id,self.name)
814
815对于临时参数中的临时参数:
/deco中的databricks/spark/python/pyspark/sql/utils.py(*a,**kw)
43 def装饰(*a,**千瓦):
44尝试:
--->45返回f(*a,**kw)
46除py4j.protocol.Py4JJavaError外,错误为e:
47 s=e.java_exception.toString()
/获取返回值中的databricks/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
306 raise PY4JJAVA错误(
307“调用{0}{1}{2}时出错。\n”。
-->308格式(目标id,“.”,名称),值)
309其他:
310升起Py4JError(
Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时出错。
:org.apache.spark.sparkeexception:作业因阶段失败而中止:阶段24.0中的任务4失败1次,最近的失败:阶段24.0中的任务4.0丢失(TID 76,本地主机):org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/databricks/spark/python/pyspark/worker.py”,主文件第111行
过程()
文件“/databricks/spark/python/pyspark/worker.py”,第106行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/databricks/spark/python/pyspark/rdd.py”,第2346行,在pipeline_func中
返回函数(拆分,上一个函数(拆分,迭代器))
文件“/databricks/spark/python/pyspark/rdd.py”,第2346行,在pipeline_func中
返回函数(拆分,上一个函数(拆分,迭代器))
文件“/databricks/spark/python/pyspark/rdd.py”,第317行,func格式
返回f(迭代器)
文件“/databricks/spark/python/pyspark/rdd.py”,第1776行,组合形式
merge.mergeValues(迭代器)
文件“/databricks/spark/python/pyspark/shuffle.py”,第236行,合并值
对于迭代器中的k,v:
文件“”,第1行,在
文件“”,第3行,第1行
文件“/databricks/python/lib/python2.7/re.py”,第171行,拆分
返回编译(模式,标志).split(字符串,maxsplit)
TypeError:应为字符串或缓冲区
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:207)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:306)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:270)
位于org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:306)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:270)
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)上
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)上
位于org.apache.spark.scheduler.Task.run(Task.scala:96)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:222)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
运行(Thread.java:745)

您的映射函数错误。错误与

re.split(r'\s+',[('a b')])
试着用

parts=re.split(r'\s+',urls[0])
哪一个将发送

re.split(r'\s+',('a b'))

整行被发送到map函数,因此您需要通过调用它们来访问单元格,例如map(lambda行:(行[0],行[1])

@shekhar我试过了,你可以在第二张图片中看到。它仍然不起作用。发布完整的回溯。事实上,在你收集之前没有计算任何东西,所以错误很可能来自上面,但只会在最后出现。@Marmouse我编辑了这个问题。我改变了这一点,当我尝试打印它时仍然会出现错误。TypeError:并非在字符串格式化过程中转换的所有参数重试,它可以使用sqlContext.createDataFrame([('ab',),('ac',),['website links']).map(lambda URL:[re.split(r'\s+',URL[0])[0],re.split(r'\s+',URL[0])[1]])。collect()