Scala toDebugString和Python toDebugString的不同输出_Python_Scala_Apache Spark_Pyspark

Scala toDebugString和Python toDebugString的不同输出

python scala apache-spark pyspark

Scala toDebugString和Python toDebugString的不同输出,python,scala,apache-spark,pyspark,Python,Scala,Apache Spark,Pyspark,我正在使用中的示例字数代码。在Python（pySpark）和Scala Apache Spark中的flatmap和map操作后生成的RDD上应用函数“toDebugString”的输出之间存在差异 Python（pySpark）代码=> 输出： (1) PythonRDD[3] at RDD at PythonRDD.scala:48 [] | /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md MapPartitio

我正在使用中的示例字数代码。在Python（pySpark）和Scala Apache Spark中的flatmap和map操作后生成的RDD上应用函数“toDebugString”的输出之间存在差异

Python（pySpark）代码=>

输出：

(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at NativeMethodAccessorImpl.java:0 []

(1) MapPartitionsRDD[3] at map at Test.scala:67 []
 |  MapPartitionsRDD[2] at flatMap at Test.scala:66 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at Test.scala:65 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at Test.scala:65 []

Scala代码=>

val text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
val countsRDD = text_file.flatMap(line => line.split(" "))
val mapRDD = countsRDD.map(word => (word, 1))
val reducedRDD = mapRDD.reduceByKey(_ + _)
print(mapRDD.toDebugString)

输出：

(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at NativeMethodAccessorImpl.java:0 []

(1) MapPartitionsRDD[3] at map at Test.scala:67 []
 |  MapPartitionsRDD[2] at flatMap at Test.scala:66 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at Test.scala:65 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at Test.scala:65 []

Scala输出清楚地显示了由flatMap和Map操作生成的两个不同的RDD。另一方面，python输出不显示这些操作，更重要的是，它只显示一个PythonRDD[3]。我认为，PythonRDD[3]是由于映射操作生成的，但是PythonRDD[3]并不依赖于前面的父RDD，PythonRDD[2]，它是由于Scala输出中所示的flatMap操作生成的

有没有办法找到这些缺失的链接？或者pySpark在内部的行为是否与Scala spark不同