Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 将RDD拆分为多个RDD和缓存_Scala_Apache Spark_Apache Spark Sql_Spark Jobserver - Fatal编程技术网

Scala 将RDD拆分为多个RDD和缓存

Scala 将RDD拆分为多个RDD和缓存,scala,apache-spark,apache-spark-sql,spark-jobserver,Scala,Apache Spark,Apache Spark Sql,Spark Jobserver,我有一个像这样的rdd (aid, session, sessionnew, date) (55-BHA, 58, 15, 2017-05-09) (07-YET, 18, 5, 2017-05-09) (32-KXD, 27, 20, 2017-05-09) (19-OJD, 10, 1, 2017-05-09) (55-BHA, 1, 0, 2017-05-09) (55-BHA, 19, 3, 2017-05-09) (32-KXD, 787, 345, 2017-05-09) (07-Y

我有一个像这样的rdd

(aid, session, sessionnew, date)
(55-BHA, 58, 15, 2017-05-09)
(07-YET, 18, 5, 2017-05-09)
(32-KXD, 27, 20, 2017-05-09)
(19-OJD, 10, 1, 2017-05-09)
(55-BHA, 1, 0, 2017-05-09)
(55-BHA, 19, 3, 2017-05-09)
(32-KXD, 787, 345, 2017-05-09)
(07-YET, 4578, 1947, 2017-05-09)
(07-YET, 23, 5, 2017-05-09)
(32-KXD, 85, 11, 2017-05-09)
我想使用相同的辅助工具将所有内容拆分为一个新的rdd,然后缓存该rdd以供以后使用,因此每个唯一的辅助工具都有一个rdd。我看到了一些其他答案,但他们正在将RDD保存到文件中。在内存中保存这么多RDD有问题吗?大概在3万左右+


我使用spark jobserver保存缓存的rdd。

我建议您将
分组的rdd
缓存,如下所示
假设您的rdd数据为:

val rddData = sparkContext.parallelize(Seq(
      ("55-BHA", 58, 15, "2017-05-09"),
      ("07-YET", 18, 5, "2017-05-09"),
      ("32-KXD", 27, 20, "2017-05-09"),
      ("19-OJD", 10, 1, "2017-05-09"),
      ("55-BHA", 1, 0, "2017-05-09"),
      ("55-BHA", 19, 3, "2017-05-09"),
      ("32-KXD", 787, 345, "2017-05-09"),
      ("07-YET", 4578, 1947, "2017-05-09"),
      ("07-YET", 23, 5, "2017-05-09"),
      ("32-KXD", 85, 11, "2017-05-09")))
通过使用“aid”分组,您可以
缓存
数据,并使用
过滤器
选择所需的
分组数据

val grouped = rddData.groupBy(_._1).cache
val filtered = grouped.filter(_._1 equals("32-KXD"))
但是我建议您使用下面的
DataFrame
,它比
rdd
s更有效、更完善

import sqlContext.implicits._
val dataFrame = Seq(
  ("55-BHA", 58, 15, "2017-05-09"),
("07-YET", 18, 5, "2017-05-09"),
("32-KXD", 27, 20, "2017-05-09"),
("19-OJD", 10, 1, "2017-05-09"),
("55-BHA", 1, 0, "2017-05-09"),
("55-BHA", 19, 3, "2017-05-09"),
("32-KXD", 787, 345, "2017-05-09"),
("07-YET", 4578, 1947, "2017-05-09"),
("07-YET", 23, 5, "2017-05-09"),
("32-KXD", 85, 11, "2017-05-09")).toDF("aid", "session", "sessionnew", "date").cache

val newDF = dataFrame.select("*").where(dataFrame("aid") === "32-KXD")
newDF.show
我希望有帮助