Scala toDebugString和Python toDebugString的不同输出

Scala toDebugString和Python toDebugString的不同输出,python,scala,apache-spark,pyspark,Python,Scala,Apache Spark,Pyspark,我正在使用中的示例字数代码。 在Python(pySpark)和Scala Apache Spark中的flatmap和map操作后生成的RDD上应用函数“toDebugString”的输出之间存在差异 Python(pySpark)代码=> 输出: (1) PythonRDD[3] at RDD at PythonRDD.scala:48 [] | /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md MapPartitio

我正在使用中的示例字数代码。 在Python(pySpark)和Scala Apache Spark中的flatmap和map操作后生成的RDD上应用函数“toDebugString”的输出之间存在差异

Python(pySpark)代码=>

输出:

(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at NativeMethodAccessorImpl.java:0 []
(1) MapPartitionsRDD[3] at map at Test.scala:67 []
 |  MapPartitionsRDD[2] at flatMap at Test.scala:66 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at Test.scala:65 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at Test.scala:65 []
Scala代码=>

val text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
val countsRDD = text_file.flatMap(line => line.split(" "))
val mapRDD = countsRDD.map(word => (word, 1))
val reducedRDD = mapRDD.reduceByKey(_ + _)
print(mapRDD.toDebugString)
输出:

(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at NativeMethodAccessorImpl.java:0 []
(1) MapPartitionsRDD[3] at map at Test.scala:67 []
 |  MapPartitionsRDD[2] at flatMap at Test.scala:66 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at Test.scala:65 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at Test.scala:65 []
Scala输出清楚地显示了由flatMap和Map操作生成的两个不同的RDD。另一方面,python输出不显示这些操作,更重要的是,它只显示一个PythonRDD[3]。我认为,PythonRDD[3]是由于映射操作生成的,但是PythonRDD[3]并不依赖于前面的父RDD,PythonRDD[2],它是由于Scala输出中所示的flatMap操作生成的

有没有办法找到这些缺失的链接?或者pySpark在内部的行为是否与Scala spark不同