Scala 查找重复火花的聚合和
输入:Scala 查找重复火花的聚合和,scala,apache-spark,Scala,Apache Spark,输入: 名称1名称2 阿尔琼·德斯瓦尔 nikhil choubey 安舒尔·潘迪亚尔 阿尔琼·德斯瓦尔 阿尔琼·德斯瓦尔 德斯瓦尔·阿琼 scala中使用的代码 val df = sqlContext.read.format("com.databricks.spark.csv") .option("header", "true") .load(FILE_PATH) val result = df.groupBy("Na
名称1名称2
阿尔琼·德斯瓦尔
nikhil choubey
安舒尔·潘迪亚尔
阿尔琼·德斯瓦尔
阿尔琼·德斯瓦尔
德斯瓦尔·阿琼
scala中使用的代码
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load(FILE_PATH)
val result = df.groupBy("Name1", "Name2")
.agg(count(lit(1))
.alias("cnt"))
获取输出:
nikhil choubey 1anshul pandyal 1
德斯瓦尔阿琼1号
阿尔琼·德斯瓦尔3号
所需输出: nikhil choubey 1
anshul pandyal 1
德斯瓦尔阿琼4号
或 nikhil choubey 1
anshul pandyal 1
arjun deshwal 4
我将使用一个集合来处理它,该集合不包含任何顺序,因此仅对集合的内容进行比较:
scala> val data = Array(
| ("arjun", "deshwal"),
| ("nikhil", "choubey"),
| ("anshul", "pandyal"),
| ("arjun", "deshwal"),
| ("arjun", "deshwal"),
| ("deshwal", "arjun")
| )
data: Array[(String, String)] = Array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun))
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:29
scala> val distDataSets = distData.map(tup => (Set(tup._1, tup._2), 1)).countByKey()
distDataSets: scala.collection.Map[scala.collection.immutable.Set[String],Long] = Map(Set(nikhil, choubey) -> 1, Set(arjun, deshwal) -> 4, Set(anshul, pandyal) -> 1)
scala>val数据=数组(
|(“arjun”、“deshwal”),
|(“nikhil”、“choubey”),
|(“anshul”、“pandyal”),
|(“arjun”、“deshwal”),
|(“arjun”、“deshwal”),
|(“德斯瓦尔”、“阿琼”)
| )
数据:数组[(字符串,字符串)]=数组((arjun,deshwal),(nikhil,choubey),(anshul,pandyal),(arjun,deshwal),(arjun,deshwal),(deshwal,arjun))
scala>val distData=sc.parallelize(数据)
distData:org.apache.spark.rdd.rdd[(String,String)]=ParallelCollectionRDD[0]位于parallelize at:29
scala>val distDataSets=distData.map(tup=>(Set(tup.\u 1,tup.\u 2),1)).countByKey()
distdataset:scala.collection.Map[scala.collection.immutable.Set[String],Long]=Map(Set(nikhil,choubey)->1,Set(arjun,deshwal)->4,Set(anshul,pandyal)->1)
希望这能有所帮助。我会使用一个集合来处理它,它不包含任何顺序,因此只会比较集合的内容:
scala> val data = Array(
| ("arjun", "deshwal"),
| ("nikhil", "choubey"),
| ("anshul", "pandyal"),
| ("arjun", "deshwal"),
| ("arjun", "deshwal"),
| ("deshwal", "arjun")
| )
data: Array[(String, String)] = Array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun))
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:29
scala> val distDataSets = distData.map(tup => (Set(tup._1, tup._2), 1)).countByKey()
distDataSets: scala.collection.Map[scala.collection.immutable.Set[String],Long] = Map(Set(nikhil, choubey) -> 1, Set(arjun, deshwal) -> 4, Set(anshul, pandyal) -> 1)
scala>val数据=数组(
|(“arjun”、“deshwal”),
|(“nikhil”、“choubey”),
|(“anshul”、“pandyal”),
|(“arjun”、“deshwal”),
|(“arjun”、“deshwal”),
|(“德斯瓦尔”、“阿琼”)
| )
数据:数组[(字符串,字符串)]=数组((arjun,deshwal),(nikhil,choubey),(anshul,pandyal),(arjun,deshwal),(arjun,deshwal),(deshwal,arjun))
scala>val distData=sc.parallelize(数据)
distData:org.apache.spark.rdd.rdd[(String,String)]=ParallelCollectionRDD[0]位于parallelize at:29
scala>val distDataSets=distData.map(tup=>(Set(tup.\u 1,tup.\u 2),1)).countByKey()
distdataset:scala.collection.Map[scala.collection.immutable.Set[String],Long]=Map(Set(nikhil,choubey)->1,Set(arjun,deshwal)->4,Set(anshul,pandyal)->1)
希望这有帮助。酷!展示了在试图找到解决方案时,根据正确的数据模型进行思考的重要性。酷!展示了在试图找到解决方案时,根据正确的数据模型进行思考的重要性。