Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala SPARK:在给定函数结果的条件下,如何合并两个数据帧?_Scala_Apache Spark_Dataframe_Apache Spark Sql_Transpose - Fatal编程技术网

Scala SPARK:在给定函数结果的条件下,如何合并两个数据帧?

Scala SPARK:在给定函数结果的条件下,如何合并两个数据帧?,scala,apache-spark,dataframe,apache-spark-sql,transpose,Scala,Apache Spark,Dataframe,Apache Spark Sql,Transpose,如果dataFrameToAdd.lable!=dataFrameMain.label,其中距离res小于0.0002 case class Schema(name: String,label: String, lat: Double, lon: Double) val dataFrameMain = sc.parallelize(Array( Schema("recordA","house",54.78049,-1.57679 ), Schema("recordB","hotel",52.0

如果dataFrameToAdd.lable!=dataFrameMain.label,其中距离res小于0.0002

case class Schema(name: String,label: String, lat: Double, lon: Double)

val dataFrameMain = sc.parallelize(Array(
Schema("recordA","house",54.78049,-1.57679 ),
Schema("recordB","hotel",52.02724,-2.16572 ),
Schema("recordC","hotel",52.51423,-1.97814 ),
Schema("recordD","house",51.46966,-0.45227 ),
Schema("recordE","house",50.91608,-1.45803 ),
Schema("recordF","house",52.59754,-1.07599 )
)).toDF()

val dataFrameToAdd = sc.parallelize(Array(
Schema("recordAduplicate","house", 54.780705, -1.576777),
Schema("recordBnotDuplicate","hotel",54.783477, -1.57986 )
 )).toDF()

def distance(latDF: Double, lonDF: Double,latNEW: Double, lonNEW: Double): Double = {
val dx = latNEW - latDF
val dy = lonNEW - lonDF
val res = math.sqrt(dx*dx + dy*dy)
return res }

import org.apache.spark.sql.functions.udf
sqlContext.udf.register("distance",distance(_:Double,_:Double, _:Double, _:Double ): Double)
我不知道如何处理这个问题。我应该应用转置函数还是可能包括Mlib矩阵数据结构?
作为此示例的输出,recordBnotDuplicate from dataFrameToAdd应与dataFrameMain合并,因为它的距离大于0.0002,但不是recordAduplicate,因为它与来自dataFrameMain的recordA具有相同的表,并且它的hs distance值小于0.0002

在注册UDF之后,将每个DF注册为一个tmp表,并使用左连接选择与a的任何记录都不匹配的B的所有记录;然后-将结果与一个:

dataFrameMain.registerTempTable("a")
dataFrameToAdd.registerTempTable("b")

val withoutDuplicates: DataFrame = sqlContext.sql(
  """
    |SELECT b.*
    |FROM b
    |LEFT JOIN a ON a.label = b.label AND distance(a.lat, a.lon, b.lat, b.lon) <= 0.002
    |WHERE a.name IS NULL
  """.stripMargin)

val result = withoutDuplicates.unionAll(dataFrameMain)
+-------------------+-----+---------+--------+
|               name|label|      lat|     lon|
+-------------------+-----+---------+--------+
|recordBnotDuplicate|hotel|54.783477|-1.57986|
|            recordA|house| 54.78049|-1.57679|
|            recordB|hotel| 52.02724|-2.16572|
|            recordC|hotel| 52.51423|-1.97814|
|            recordD|house| 51.46966|-0.45227|
|            recordE|house| 50.91608|-1.45803|
|            recordF|house| 52.59754|-1.07599|
+-------------------+-----+---------+--------+