Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala spark-dataframe中列表值的计数_Scala_Apache Spark_Apache Spark Sql_Datastax Enterprise_Cassandra 2.1 - Fatal编程技术网

Scala spark-dataframe中列表值的计数

Scala spark-dataframe中列表值的计数,scala,apache-spark,apache-spark-sql,datastax-enterprise,cassandra-2.1,Scala,Apache Spark,Apache Spark Sql,Datastax Enterprise,Cassandra 2.1,在cassandra中,我有一个列表列类型。我不熟悉spark和scala,不知道从哪里开始。 在spark中,我想获得每个值的计数,是否可以这样做。 下面是数据帧 +--------------------+------------+ | id| data| +--------------------+------------+ |53e5c3b0-8c83-11e...| [b, c]| |508c1160-8c83-11e...|

在cassandra中,我有一个列表列类型。我不熟悉spark和scala,不知道从哪里开始。 在spark中,我想获得每个值的计数,是否可以这样做。 下面是数据帧

+--------------------+------------+
|                  id|        data|
+--------------------+------------+
|53e5c3b0-8c83-11e...|      [b, c]|
|508c1160-8c83-11e...|      [a, b]|
|4d16c0c0-8c83-11e...|   [a, b, c]|
|5774dde0-8c83-11e...|[a, b, c, d]|
+--------------------+------------+
我希望输出为

+--------------------+------------+
|   value            |      count |
+--------------------+------------+
|a                   |      3     |
|b                   |      4     |
|c                   |      3     |
|d                   |      1     |
+--------------------+------------+

spark版本:1.4

您需要这样的东西(来自):

假设您已经有了配对,.reduceByKey(+u)将返回您需要的内容

您也可以在spark shell中尝试以下内容:

sc.parallelize(Array[Integer](1,1,1,2,2),3).map(x=>(x,1)).reduceByKey(_+_).foreach(println)
给你:

scala> val rdd = sc.parallelize(
  Seq(
    ("53e5c3b0-8c83-11e", Array("b", "c")),
    ("53e5c3b0-8c83-11e1", Array("a", "b")),
    ("53e5c3b0-8c83-11e2", Array("a", "b", "c")),
    ("53e5c3b0-8c83-11e3", Array("a", "b", "c", "d"))))
// rdd: org.apache.spark.rdd.RDD[(String, Array[String])] = ParallelCollectionRDD[22] at parallelize at <console>:27

scala> rdd.flatMap(_._2).map((_, 1)).reduceByKey(_ + _)
// res11: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:30

scala> rdd.flatMap(_._2).map((_,1)).reduceByKey(_ + _).collect
// res16: Array[(String, Int)] = Array((a,3), (b,4), (c,3), (d,1))
scala>val-rdd=sc.parallelize(
序号(
(“53e5c3b0-8c83-11e”,阵列(“b”、“c”),
(“53e5c3b0-8c83-11e1”,阵列(“a”、“b”),
(“53e5c3b0-8c83-11e2”,阵列(“a”、“b”、“c”),
(“53e5c3b0-8c83-11e3”,阵列(“a”、“b”、“c”、“d”))
//rdd:org.apache.spark.rdd.rdd[(String,Array[String])]=ParallelCollectionRDD[22]at parallelize at:27
scala>rdd.flatMap(u._2).map(u,1)).reduceByKey(u+u)
//res11:org.apache.spark.rdd.rdd[(String,Int)]=ShuffledRDD[21]位于reduceByKey的:30
scala>rdd.flatMap(u._2).map(u,1)).reduceByKey(u+).collect
//res16:Array[(String,Int)]=数组((a,3)、(b,4)、(c,3)、(d,1))
使用DataFrame API,这实际上也很容易:

scala> val df = rdd.toDF("id", "data")
// res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>]

scala> df.select(explode($"data").as("value")).groupBy("value").count.show
// +-----+-----+
// |value|count|
// +-----+-----+
// |    d|    1|
// |    c|    3|
// |    b|    4|
// |    a|    3|
// +-----+-----+
scala>val df=rdd.toDF(“id”,“data”)
//res12:org.apache.spark.sql.DataFrame=[“id”:字符串,“data”:数组]
scala>df.select(分解($“数据”).as(“值”).groupBy(“值”).count.show
// +-----+-----+
//|值|计数|
// +-----+-----+
//| d | 1|
//| c | 3|
//| b | 4|
//| a | 3|
// +-----+-----+

您能否提供解决方案的pyspark实现?
scala> val df = rdd.toDF("id", "data")
// res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>]

scala> df.select(explode($"data").as("value")).groupBy("value").count.show
// +-----+-----+
// |value|count|
// +-----+-----+
// |    d|    1|
// |    c|    3|
// |    b|    4|
// |    a|    3|
// +-----+-----+