Apache spark ApacheSpark-访问RDD上的内部数据?
我开始做这个。我尝试了以下两种场景: 情景#1Apache spark ApacheSpark-访问RDD上的内部数据?,apache-spark,rdd,checkpointing,Apache Spark,Rdd,Checkpointing,我开始做这个。我尝试了以下两种场景: 情景#1 val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint pagecounts.count val pagecounts = sc.textFile("data/pagecounts") pagecounts.count 场景#2 val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoi
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count
场景#2
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count
Spark shell应用程序UI中显示的总时间对于这两个应用程序都不同
情景。场景1耗时0.5秒,而场景2仅耗时0.2秒 在场景#1中,checkpoint命令不起任何作用,它既不是 转变也不是一种行动。这意味着一旦RDD实现 操作完成后,继续并保存到磁盘。我是不是遗漏了什么 问题:
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count
Spark shell应用程序UI显示以下内容-调度程序延迟、任务 反序列化时间、GC时间、结果序列化时间、获取结果 时间但是,没有显示检查点的细分
- RDD的大小,当在检查点上持久化到磁盘时李>
- 当前内存中RDD的百分比是多少李>
- 计算RDD所需的总时间李>
GET/api/v1/applications/[app id]/storage/rdd/0
将得到以下答复:
{
"id" : 0,
"name" : "ParallelCollectionRDD",
"numPartitions" : 2,
"numCachedPartitions" : 2,
"storageLevel" : "Memory Deserialized 1x Replicated",
"memoryUsed" : 28000032,
"diskUsed" : 0,
"dataDistribution" : [ {
"address" : "localhost:54984",
"memoryUsed" : 28000032,
"memoryRemaining" : 527755733,
"diskUsed" : 0
} ],
"partitions" : [ {
"blockName" : "rdd_0_0",
"storageLevel" : "Memory Deserialized 1x Replicated",
"memoryUsed" : 14000016,
"diskUsed" : 0,
"executors" : [ "localhost:54984" ]
}, {
"blockName" : "rdd_0_1",
"storageLevel" : "Memory Deserialized 1x Replicated",
"memoryUsed" : 14000016,
"diskUsed" : 0,
"executors" : [ "localhost:54984" ]
} ]
}
{
"quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
"executorDeserializeTime" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],
"executorRunTime" : [ 3.0, 3.0, 4.0, 4.0, 4.0 ],
"resultSize" : [ 1457.0, 1457.0, 1457.0, 1457.0, 1457.0 ],
"jvmGcTime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"resultSerializationTime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"memoryBytesSpilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"diskBytesSpilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"shuffleReadMetrics" : {
"readBytes" : [ 340.0, 340.0, 342.0, 342.0, 342.0 ],
"readRecords" : [ 10.0, 10.0, 10.0, 10.0, 10.0 ],
"remoteBlocksFetched" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"localBlocksFetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],
"fetchWaitTime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"remoteBytesRead" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"totalBlocksFetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ]
}
}
计算RDD所需的总时间
计算RDD也称为作业、阶段或尝试。
GET/applications/[app id]/stages/[stage id]/[stage trust id]/taskSummary
将得到以下答复:
{
"id" : 0,
"name" : "ParallelCollectionRDD",
"numPartitions" : 2,
"numCachedPartitions" : 2,
"storageLevel" : "Memory Deserialized 1x Replicated",
"memoryUsed" : 28000032,
"diskUsed" : 0,
"dataDistribution" : [ {
"address" : "localhost:54984",
"memoryUsed" : 28000032,
"memoryRemaining" : 527755733,
"diskUsed" : 0
} ],
"partitions" : [ {
"blockName" : "rdd_0_0",
"storageLevel" : "Memory Deserialized 1x Replicated",
"memoryUsed" : 14000016,
"diskUsed" : 0,
"executors" : [ "localhost:54984" ]
}, {
"blockName" : "rdd_0_1",
"storageLevel" : "Memory Deserialized 1x Replicated",
"memoryUsed" : 14000016,
"diskUsed" : 0,
"executors" : [ "localhost:54984" ]
} ]
}
{
"quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],
"executorDeserializeTime" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],
"executorRunTime" : [ 3.0, 3.0, 4.0, 4.0, 4.0 ],
"resultSize" : [ 1457.0, 1457.0, 1457.0, 1457.0, 1457.0 ],
"jvmGcTime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"resultSerializationTime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"memoryBytesSpilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"diskBytesSpilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"shuffleReadMetrics" : {
"readBytes" : [ 340.0, 340.0, 342.0, 342.0, 342.0 ],
"readRecords" : [ 10.0, 10.0, 10.0, 10.0, 10.0 ],
"remoteBlocksFetched" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"localBlocksFetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],
"fetchWaitTime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"remoteBytesRead" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],
"totalBlocksFetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ]
}
}
你的问题太宽泛了,所以我不会一一回答。我相信spark需要反映的一切都会反映在REST API中。我知道我的问题很广泛,谢谢你的回答。我可以在应用程序运行时使用RESTAPI吗?可以。实际上,对于自包含的应用程序,您只能在应用程序运行时访问RESTAPI。应用程序终止后,API服务器将随之终止。我现在就知道了。在应用程序停止后,我试图访问RESTAPI。如果是这样,我如何访问spark应用程序中的应用程序ID。我想在我的程序中执行每个操作后记录上面的数据。这已经失控,`sc.set(“spark.app.id”,“myId”)就可以了。你可以很容易地发现这一点。