如何避免在Scala的Spark RDD中使用collect?
我有一个列表,必须从中创建映射以供进一步使用,我使用的是RDD,但使用collect()时,集群中的作业失败了。感谢您的帮助 请帮忙。下面是从List到rdd.collect的示例代码。 我必须进一步使用此地图数据,但如何使用而不收集 这段代码从RDD(列表)数据创建一个映射。列表格式->(asdfg/1234/wert,asdf)如何避免在Scala的Spark RDD中使用collect?,scala,apache-spark,rdd,persist,collect,Scala,Apache Spark,Rdd,Persist,Collect,我有一个列表,必须从中创建映射以供进一步使用,我使用的是RDD,但使用collect()时,集群中的作业失败了。感谢您的帮助 请帮忙。下面是从List到rdd.collect的示例代码。 我必须进一步使用此地图数据,但如何使用而不收集 这段代码从RDD(列表)数据创建一个映射。列表格式->(asdfg/1234/wert,asdf) 问:如何在没有收集的情况下使用 回答:collect将命中。。它将数据移动到驱动程序节点。如果数据是 巨大的。永远不要那样做 我不知道准备映射图的用例是什么,
问:如何在没有收集的情况下使用
回答:
collect
将命中。。它将数据移动到驱动程序节点。如果数据是
巨大的。永远不要那样做
我不知道准备
映射图的用例是什么,但可以使用内置的spark API实现,即collectionAccumulator
。。。具体地说,
collectionAccumulator[scala.collection.mutable.Map[String,String]
让我们假设,这是您的示例数据帧,您想要制作一个映射
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
从这里,您希望创建一个映射(嵌套映射,在您的示例中,我的前缀是nestedmap key name),然后
下面是完整的示例,请查看并进行相应修改
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
结果:
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
您可以添加一些输入和输出示例吗?它是如何失败的?这是内存不足的错误吗?您如何使用收集的数据?列表:输入列表:列表((Start~>HInfo~>Monitor~>VSData,XYZVN),(Start~>HInfo~>Cables~>Cables-1~>Name,LC),(Start~>HInfo~>Disk~>Disk-1~>Partition~>Partition-1~>Name,未使用))示例输出映射:Map(项目Id->0909,父级Id->1234,对象类实例->电缆-3,接收时间->23-12-2020,电缆->地图(索引->2,状态->已安装,HInfoID->ABCD1234,名称->WLAN),对象类->电缆,服务标签->ASDF123,扫描时间->12345,关系树->开始->HInfo~>电缆->电缆-3)地图(项目Id->0909,父级Id->1234,对象类实例->电缆-1,接收时间->23-12-2020,电缆->地图(名称->LC,状态->已安装,HInfoID->ABCD1234,索引->0),对象类->电缆,服务标签->ASDF123,扫描时间->12345,关系树->开始->HInfo~>电缆->电缆-1)Hi@Ram Ghadiyaram,这工作正常,但实际代码在rdd中运行正常,当我尝试将列表转换为DF()时,我得到了空指针异常。val listData=(relationData)。toList val rawDF=listData.toDF(“rln_tr”,“attr_val”)rawDF.show(false)//RawDF有空指针异常val rdd=sparkContext.makeRDD(listData)//这工作正常,但您必须确定在空指针上执行操作的位置。我不确定您的数据。我说不收集
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]