Apache spark 如何加入3rdd';s使用Spark Scala
我想使用Apache spark 如何加入3rdd';s使用Spark Scala,apache-spark,hadoop,apache-spark-sql,bigdata,rdd,Apache Spark,Hadoop,Apache Spark Sql,Bigdata,Rdd,我想使用sparkrdd加入3个表。我使用spark sql实现了我的目标,但是当我尝试使用Rdd加入它时,我没有得到期望的结果。下面是我使用sparksql和输出进行的查询: scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id"). filter(col("df3.mov_
sparkrdd
加入3个表。我使用spark sql实现了我的目标,但是当我尝试使用Rdd加入它时,我没有得到期望的结果。下面是我使用sparksql
和输出进行的查询:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
现在,我为三个数据集创建了pairedRDDs
,如下所示:
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
我得到的是空白记录。那么,我错在哪里呢??提前感谢像这样加入RDD是痛苦的,这是DFs更好的另一个原因
由于RDD=K对没有数据,V没有上一个RDD的K部分的公共数据。与101102的K将加入,但与901902没有共同之处。你需要改变一些事情,比如我的例子:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
返回:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
你需要通过地图删除数据,我留给你。默认情况下的内部连接。像这样与RDD连接是痛苦的,这是DFs更好的另一个原因
由于RDD=K对没有数据,V没有上一个RDD的K部分的公共数据。与101102的K将加入,但与901902没有共同之处。你需要改变一些事情,比如我的例子:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
返回:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
你需要通过地图删除数据,我留给你。默认内部连接。是的,谢谢@thebluephantom。你让我开心。我知道使用DFs要简单得多,但我想知道如何使用RDD也能做到这一点。你们能帮我一下吗?是的,谢谢@thebluephantom。你让我开心。我知道使用DFs这要简单得多,但我想知道如何使用RDD也能做到这一点。你们能帮我解决这个问题吗