Scala spark-dataframe中列表值的计数
在cassandra中,我有一个列表列类型。我不熟悉spark和scala,不知道从哪里开始。 在spark中,我想获得每个值的计数,是否可以这样做。 下面是数据帧Scala spark-dataframe中列表值的计数,scala,apache-spark,apache-spark-sql,datastax-enterprise,cassandra-2.1,Scala,Apache Spark,Apache Spark Sql,Datastax Enterprise,Cassandra 2.1,在cassandra中,我有一个列表列类型。我不熟悉spark和scala,不知道从哪里开始。 在spark中,我想获得每个值的计数,是否可以这样做。 下面是数据帧 +--------------------+------------+ | id| data| +--------------------+------------+ |53e5c3b0-8c83-11e...| [b, c]| |508c1160-8c83-11e...|
+--------------------+------------+
| id| data|
+--------------------+------------+
|53e5c3b0-8c83-11e...| [b, c]|
|508c1160-8c83-11e...| [a, b]|
|4d16c0c0-8c83-11e...| [a, b, c]|
|5774dde0-8c83-11e...|[a, b, c, d]|
+--------------------+------------+
我希望输出为
+--------------------+------------+
| value | count |
+--------------------+------------+
|a | 3 |
|b | 4 |
|c | 3 |
|d | 1 |
+--------------------+------------+
spark版本:1.4您需要这样的东西(来自): 假设您已经有了配对,.reduceByKey(+u)将返回您需要的内容 您也可以在spark shell中尝试以下内容:
sc.parallelize(Array[Integer](1,1,1,2,2),3).map(x=>(x,1)).reduceByKey(_+_).foreach(println)
给你:
scala> val rdd = sc.parallelize(
Seq(
("53e5c3b0-8c83-11e", Array("b", "c")),
("53e5c3b0-8c83-11e1", Array("a", "b")),
("53e5c3b0-8c83-11e2", Array("a", "b", "c")),
("53e5c3b0-8c83-11e3", Array("a", "b", "c", "d"))))
// rdd: org.apache.spark.rdd.RDD[(String, Array[String])] = ParallelCollectionRDD[22] at parallelize at <console>:27
scala> rdd.flatMap(_._2).map((_, 1)).reduceByKey(_ + _)
// res11: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:30
scala> rdd.flatMap(_._2).map((_,1)).reduceByKey(_ + _).collect
// res16: Array[(String, Int)] = Array((a,3), (b,4), (c,3), (d,1))
scala>val-rdd=sc.parallelize(
序号(
(“53e5c3b0-8c83-11e”,阵列(“b”、“c”),
(“53e5c3b0-8c83-11e1”,阵列(“a”、“b”),
(“53e5c3b0-8c83-11e2”,阵列(“a”、“b”、“c”),
(“53e5c3b0-8c83-11e3”,阵列(“a”、“b”、“c”、“d”))
//rdd:org.apache.spark.rdd.rdd[(String,Array[String])]=ParallelCollectionRDD[22]at parallelize at:27
scala>rdd.flatMap(u._2).map(u,1)).reduceByKey(u+u)
//res11:org.apache.spark.rdd.rdd[(String,Int)]=ShuffledRDD[21]位于reduceByKey的:30
scala>rdd.flatMap(u._2).map(u,1)).reduceByKey(u+).collect
//res16:Array[(String,Int)]=数组((a,3)、(b,4)、(c,3)、(d,1))
使用DataFrame API,这实际上也很容易:
scala> val df = rdd.toDF("id", "data")
// res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>]
scala> df.select(explode($"data").as("value")).groupBy("value").count.show
// +-----+-----+
// |value|count|
// +-----+-----+
// | d| 1|
// | c| 3|
// | b| 4|
// | a| 3|
// +-----+-----+
scala>val df=rdd.toDF(“id”,“data”)
//res12:org.apache.spark.sql.DataFrame=[“id”:字符串,“data”:数组]
scala>df.select(分解($“数据”).as(“值”).groupBy(“值”).count.show
// +-----+-----+
//|值|计数|
// +-----+-----+
//| d | 1|
//| c | 3|
//| b | 4|
//| a | 3|
// +-----+-----+
您能否提供解决方案的pyspark实现?
scala> val df = rdd.toDF("id", "data")
// res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>]
scala> df.select(explode($"data").as("value")).groupBy("value").count.show
// +-----+-----+
// |value|count|
// +-----+-----+
// | d| 1|
// | c| 3|
// | b| 4|
// | a| 3|
// +-----+-----+