Hive 在pyspark中读取配置单元表并更新相同的表-使用检查点_Hive_Pyspark_Spark Checkpoint

Hive 在pyspark中读取配置单元表并更新相同的表-使用检查点

hive pyspark

Hive 在pyspark中读取配置单元表并更新相同的表-使用检查点,hive,pyspark,spark-checkpoint,Hive,Pyspark,Spark Checkpoint,我正在使用spark 2.3版，并尝试将spark中的配置单元表解读为： from pyspark.sql import SparkSession from pyspark.sql.functions import * df = spark.table("emp.emptable") 在这里，我将添加一个新列，其中包含来自系统的当前日期，并将其添加到现有数据框中 import pyspark.sql.functions as F newdf = df.withColumn('LOAD_DATE

我正在使用spark 2.3版，并尝试将spark中的配置单元表解读为：

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")

在这里，我将添加一个新列，其中包含来自系统的当前日期，并将其添加到现有数据框中

import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())

checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")

现在面临一个问题，当我试图将这个数据帧编写为配置单元表时

newdf.write.mode("overwrite").saveAsTable("emp.emptable")

pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'

因此，我检查数据帧以打破血统，因为我从同一数据帧读取和写入数据

import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())

checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")

通过这种方式，它工作正常，新列已添加到配置单元表中。但是每次创建检查点文件时，我都必须删除它。有没有最好的方法来中断沿袭并使用更新的列详细信息编写相同的数据帧，并将其保存到hdfs位置或作为配置单元表保存

或者有没有办法为检查点目录指定一个临时位置，它将在spark会话完成后被删除。

正如我们在文章中所讨论的，设置下面的属性是一个好办法

spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")

这个问题有不同的背景。我们想保留检查点数据集，所以不想添加清理解决方案

设置上述属性有时会起作用（测试了scala、java和python），但很难依赖它。官方文档说，通过设置此属性，它

控制在引用超出范围时是否清理检查点文件。

我不知道它的确切含义，因为我的理解是，一旦spark会话/上下文停止，它应该清理它。如果有人能在上面遮光那就太好了

关于

有没有最好的办法打破世系

检查问题，@BiS找到了一些使用

createDataFrame（RDD，Schema）

方法切割沿袭的方法。不过我还没有亲自测试过

仅供参考，为了安全起见，我通常不依赖上面的属性，而是删除代码本身中的

checkpointed

我们可以获得如下所示的

检查点

Scala:

//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")

scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3

//It gives String so we can use org.apache.hadoop.fs to delete path

// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t 
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'

# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

//设置目录
scala>spark.sparkContext.setCheckpointDir（“hdfs:///tmp/checkpoint/")
scala>spark.sparkContext.getCheckpointDir.get
res3:字符串=hdfs:///tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//它给出了字符串，所以我们可以使用org.apache.hadoop.fs来删除路径

PySpark:

//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")

scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3

//It gives String so we can use org.apache.hadoop.fs to delete path

// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t 
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'

# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

//设置目录
>>>spark.sparkContext.setCheckpointDir（'hdfs:///tmp/checkpoint')
>>>t=sc.\u jsc.sc（）.getCheckpointDir（）.get（）
>>>t
u'hdfs:///tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
#注意开头的“u”，这意味着它返回unicode对象use str（t）
#下面是获取hadoop文件系统对象和删除的步骤
>>>fs=sc._jvm.org.apache.hadoop.fs.FileSystem.get（sc._jsc.hadoopConfiguration（））
exists（sc._jvm.org.apache.hadoop.fs.Path（str（t）））
真的
>>>删除（sc._jvm.org.apache.hadoop.fs.Path（str（t）））
真的
>>>fs=sc._jvm.org.apache.hadoop.fs.FileSystem.get（sc._jsc.hadoopConfiguration（））
exists（sc._jvm.org.apache.hadoop.fs.Path（str（t）））
假的

可能的重复我想这是指其他问题。我一直在寻找一种解决方案来打破数据帧转换的沿袭，因为我正在读取和更新同一个数据帧，然后将其作为配置单元表进行持久化：-（非常感谢。是的，我读了那篇文章，非常喜欢。这是一篇关于检查点和重新启动的好信息。我想在Pyspark中试试。@vikrantrana:血统和检查点对我来说总是非常棘手和有趣的话题。是的。我也看到了你在其他话题上的两个很棒的答案。谢谢你你的努力和分享真的很有趣。@vikrantrana当然。谢谢你和WC:）这总是值得分享的。