Scala 查找重复火花的聚合和_Scala_Apache Spark

Scala 查找重复火花的聚合和

scala apache-spark

Scala 查找重复火花的聚合和,scala,apache-spark,Scala,Apache Spark,输入：名称1名称2 阿尔琼·德斯瓦尔 nikhil choubey 安舒尔·潘迪亚尔阿尔琼·德斯瓦尔阿尔琼·德斯瓦尔德斯瓦尔·阿琼 scala中使用的代码 val df = sqlContext.read.format("com.databricks.spark.csv") .option("header", "true") .load(FILE_PATH) val result = df.groupBy("Na

输入：

名称1名称2
阿尔琼·德斯瓦尔
nikhil choubey
安舒尔·潘迪亚尔
阿尔琼·德斯瓦尔
阿尔琼·德斯瓦尔
德斯瓦尔·阿琼

scala中使用的代码

val df = sqlContext.read.format("com.databricks.spark.csv")
                   .option("header", "true")
                   .load(FILE_PATH)
val result = df.groupBy("Name1", "Name2")
               .agg(count(lit(1))
               .alias("cnt"))

获取输出：

nikhil choubey 1
anshul pandyal 1
德斯瓦尔阿琼1号
阿尔琼·德斯瓦尔3号

所需输出：

nikhil choubey 1
anshul pandyal 1
德斯瓦尔阿琼4号

或

nikhil choubey 1
anshul pandyal 1

arjun deshwal 4

我将使用一个集合来处理它，该集合不包含任何顺序，因此仅对集合的内容进行比较：

scala> val data = Array(
 |     ("arjun",   "deshwal"),
 |     ("nikhil",  "choubey"),
 |     ("anshul",  "pandyal"),
 |     ("arjun",   "deshwal"),
 |     ("arjun",   "deshwal"),
 |     ("deshwal", "arjun")
 | )
data: Array[(String, String)] = Array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun))

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:29

scala> val distDataSets = distData.map(tup => (Set(tup._1, tup._2), 1)).countByKey()
distDataSets: scala.collection.Map[scala.collection.immutable.Set[String],Long] = Map(Set(nikhil, choubey) -> 1, Set(arjun, deshwal) -> 4, Set(anshul, pandyal) -> 1)

scala>val数据=数组(
|（“arjun”、“deshwal”），
|（“nikhil”、“choubey”），
|（“anshul”、“pandyal”），
|（“arjun”、“deshwal”），
|（“arjun”、“deshwal”），
|（“德斯瓦尔”、“阿琼”）
| )
数据：数组[（字符串，字符串）]=数组（（arjun，deshwal），（nikhil，choubey），（anshul，pandyal），（arjun，deshwal），（arjun，deshwal），（deshwal，arjun））
scala>val distData=sc.parallelize（数据）
distData:org.apache.spark.rdd.rdd[（String，String）]=ParallelCollectionRDD[0]位于parallelize at:29
scala>val distDataSets=distData.map（tup=>（Set（tup.\u 1，tup.\u 2），1））.countByKey（）
distdataset:scala.collection.Map[scala.collection.immutable.Set[String]，Long]=Map（Set（nikhil，choubey）->1，Set（arjun，deshwal）->4，Set（anshul，pandyal）->1）

希望这能有所帮助。

我会使用一个集合来处理它，它不包含任何顺序，因此只会比较集合的内容：

scala> val data = Array(
 |     ("arjun",   "deshwal"),
 |     ("nikhil",  "choubey"),
 |     ("anshul",  "pandyal"),
 |     ("arjun",   "deshwal"),
 |     ("arjun",   "deshwal"),
 |     ("deshwal", "arjun")
 | )
data: Array[(String, String)] = Array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun))

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:29

scala> val distDataSets = distData.map(tup => (Set(tup._1, tup._2), 1)).countByKey()
distDataSets: scala.collection.Map[scala.collection.immutable.Set[String],Long] = Map(Set(nikhil, choubey) -> 1, Set(arjun, deshwal) -> 4, Set(anshul, pandyal) -> 1)

scala>val数据=数组(
|（“arjun”、“deshwal”），
|（“nikhil”、“choubey”），
|（“anshul”、“pandyal”），
|（“arjun”、“deshwal”），
|（“arjun”、“deshwal”），
|（“德斯瓦尔”、“阿琼”）
| )
数据：数组[（字符串，字符串）]=数组（（arjun，deshwal），（nikhil，choubey），（anshul，pandyal），（arjun，deshwal），（arjun，deshwal），（deshwal，arjun））
scala>val distData=sc.parallelize（数据）
distData:org.apache.spark.rdd.rdd[（String，String）]=ParallelCollectionRDD[0]位于parallelize at:29
scala>val distDataSets=distData.map（tup=>（Set（tup.\u 1，tup.\u 2），1））.countByKey（）
distdataset:scala.collection.Map[scala.collection.immutable.Set[String]，Long]=Map（Set（nikhil，choubey）->1，Set（arjun，deshwal）->4，Set（anshul，pandyal）->1）

希望这有帮助。

酷！展示了在试图找到解决方案时，根据正确的数据模型进行思考的重要性。酷！展示了在试图找到解决方案时，根据正确的数据模型进行思考的重要性。