Apache spark 如何在spark应用中制作特殊RDD

Apache spark 如何在spark应用中制作特殊RDD,apache-spark,rdd,Apache Spark,Rdd,我有一个RDD(K,Iterable[V]) 如果没有collect-in驱动程序,如何将其转换为RDD(K,RDD[V]) -------------------重新编辑---------------------------------------------- val dataList = DataLoader.loadTrainTestData(hiveContext.sql(sampleDataHql)).collect().map(ds => (ds._1, sc.paralle

我有一个RDD(K,Iterable[V])

如果没有collect-in驱动程序,如何将其转换为RDD(K,RDD[V])

-------------------重新编辑----------------------------------------------

val dataList = DataLoader.loadTrainTestData(hiveContext.sql(sampleDataHql)).collect().map(ds => (ds._1, sc.parallelize(ds._2.toSeq)))

//Train Model and Test
val resultData = sc.parallelize(dataList).map{ ds =>
  val remark = ds._1
  val data = ds._2.randomSplit(Array(0.6, 0.4), seed = 11L)

  val model = new LogisticRegressionWithLBFGS().setNumClasses(2).setIntercept(true).run(data(0))

  val trainAUC = ModelTester.getAUC(data(1), model)
  val modelWeight = model.weights.toArray.map(_.toString).reduce(_ + "_" + _)

  modelWeight + "|" + model.intercept.toFloat + "|" + model.numClasses+ "|" + model.numFeatures + "|" + trainAUC.toFloat + "|" + remark
}
我已经创建了一个嵌套RDD,但它会在驱动程序中收集并重新并行化,占用太多内存

-------------------重新编辑----------------------------------------------

val dataList = DataLoader.loadTrainTestData(hiveContext.sql(sampleDataHql)).collect().map(ds => (ds._1, sc.parallelize(ds._2.toSeq)))

//Train Model and Test
val resultData = sc.parallelize(dataList).map{ ds =>
  val remark = ds._1
  val data = ds._2.randomSplit(Array(0.6, 0.4), seed = 11L)

  val model = new LogisticRegressionWithLBFGS().setNumClasses(2).setIntercept(true).run(data(0))

  val trainAUC = ModelTester.getAUC(data(1), model)
  val modelWeight = model.weights.toArray.map(_.toString).reduce(_ + "_" + _)

  modelWeight + "|" + model.intercept.toFloat + "|" + model.numClasses+ "|" + model.numFeatures + "|" + trainAUC.toFloat + "|" + remark
}
或者,我如何将RDD(K,Iterable[V])转换为Array(K,RDD[V])而不使用collect-on驱动程序

或者运行多数据集的更好方法是什么

-------------------重新编辑----------------------------------------------

val dataList = DataLoader.loadTrainTestData(hiveContext.sql(sampleDataHql)).collect().map(ds => (ds._1, sc.parallelize(ds._2.toSeq)))

//Train Model and Test
val resultData = sc.parallelize(dataList).map{ ds =>
  val remark = ds._1
  val data = ds._2.randomSplit(Array(0.6, 0.4), seed = 11L)

  val model = new LogisticRegressionWithLBFGS().setNumClasses(2).setIntercept(true).run(data(0))

  val trainAUC = ModelTester.getAUC(data(1), model)
  val modelWeight = model.weights.toArray.map(_.toString).reduce(_ + "_" + _)

  modelWeight + "|" + model.intercept.toFloat + "|" + model.numClasses+ "|" + model.numFeatures + "|" + trainAUC.toFloat + "|" + remark
}
嵌套RDD确实是不允许的!谢谢你的


如果有一种方法可以创建数组(K,RDD[V])而不使用collect-in-driver?

这是一个嵌套的RDD,这是不可能的。我已经阅读了答案,但我仍然很困惑。我重新编辑了我的问题,认为RDD是不可能的。必须重写代码才能不使用嵌套RDD。通常,您通过使用
join
来实现这一点。请不要添加代码的快照,而是添加实际代码。尽管如此,上面的所有注释都是有效的。