Apache spark 截断表后刷新缓存的数据帧_Apache Spark_Apache Spark Sql

Apache spark 截断表后刷新缓存的数据帧

apache-spark

Apache spark 截断表后刷新缓存的数据帧,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,以下是步骤： scala> val df = sql("select * from table") df: org.apache.spark.sql.DataFrame = [num: int] scala> df.cache res13: df.type = [num: int] scala> df.collect res14: Array[org.apache.spark.sql.Row] = Array([10], [10]) scala> df res15:

以下是步骤：

scala> val df = sql("select * from table")
df: org.apache.spark.sql.DataFrame = [num: int]

scala> df.cache
res13: df.type = [num: int]

scala> df.collect
res14: Array[org.apache.spark.sql.Row] = Array([10], [10])

scala> df
res15: org.apache.spark.sql.DataFrame = [num: int]

scala> df.show
+---+
|num|
+---+
| 10|
| 10|
+---+


scala> sql("truncate table table")
res17: org.apache.spark.sql.DataFrame = []

scala> df.show
+---+
|num|
+---+
+---+

我的问题是为什么df会被刷新？我的期望是它应该缓存在内存中，而truncate不应该删除数据

任何想法都将不胜感激

谢谢

您永远不应该依赖

缓存

来获得正确性。Spark

缓存是一种性能优化，即使使用最具防御性的存储级别（内存和磁盘服务2
）也不能保证在工作程序故障、执行器停用或资源不足的情况下保留数据
与问题中使用的代码类似的代码可能在某些情况下工作，但不要假定它是有保证的或确定性的行为。
使用truncate table命令删除缓存数据，然后取消缓存并清空表。是truncate
的源代码。如果您按照该链接访问TruncateTableCommand
的源代码，那么在case类的底部，您将看到以下内容，了解在表被截断时如何处理缓存和表：
// After deleting the data, invalidate the table to make sure we don't keep around a stale
// file relation in the metastore cache.
spark.sessionState.refreshTable(tableName.unquotedString)
// Also try to drop the contents of the table from the columnar cache
try {
  spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier))
} catch {
  case NonFatal(e) =>
    log.warn(s"Exception when attempting to uncache table $tableIdentWithDB", e)
}

if (table.stats.nonEmpty) {
  // empty table after truncation
  val newStats = CatalogStatistics(sizeInBytes = 0, rowCount = Some(0))
  catalog.alterTableStats(tableName, Some(newStats))
}
Seq.empty[Row]