Amazon s3 S3中spark.catalog.refreshttable（tablename）的用法_Amazon S3_Pyspark_Apache Spark Sql

Amazon s3 S3中spark.catalog.refreshttable（tablename）的用法

amazon-s3 pyspark

Amazon s3 S3中spark.catalog.refreshttable（tablename）的用法,amazon-s3,pyspark,apache-spark-sql,Amazon S3,Pyspark,Apache Spark Sql,我想在使用函数转换Spark数据后编写一个CSV文件。转换后获得的Spark数据帧看起来不错，但当我想将其写入CSV文件时，我有一个错误： It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dat

我想在使用函数转换Spark数据后编写一个CSV文件。转换后获得的Spark数据帧看起来不错，但当我想将其写入CSV文件时，我有一个错误：

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

但是我真的不明白如何使用

spark.catalog.refreshttable（tablename）

函数。我尝试在转换和文件编写之间使用它，但它说

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

所以我不知道该怎么处理

#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
    #Transform image data to tensorflow compatoble format
    images = []
    for i in range(height.shape[0]):
        x = np.ndarray(
                shape=(height[i], width[i], nChannels[i]),
                dtype=np.uint8,
                buffer=data[i],
                strides=(width[i] * nChannels[i], nChannels[i], 1))
        images.append(preprocess_input(x))
    #Resize images with the chosen size of the model
    images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))

    #Load the model
    model = load_model('models')
    
    #Predict features for images
    preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
    
    #Return a pandas series with list of features for all images 
    return pd.Series(list(preds))

#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))

#4 actions : 
#   apply the udf function defined just before
#   cast the array of features to a string so it can be written in a csv
#   select only the data that will be witten in the csv
#   write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
                                             col("image.nChannels"), \
                                             col("image.data"))) \
          .withColumn("dim_red_string", lit(col("dim_red").cast("string")))
          .select("image.origin", 'dim_red_string')
          .repartition(5).write.csv(S3dir + '/results' + today)

这是一个众所周知的问题，在spark对其进行处理时，底层源数据正在更新

我建议您在应用转换之前进行检查点操作，即将数据移动/复制到另一个目录。

这是一个众所周知的问题，spark在对其进行处理时会更新基础源数据

我建议您在应用转换之前检查点，即将数据移动/复制到另一个目录。

我想我可以结束我的问题，因为我找到了答案

如果您有这种类型的错误，也可能是因为S3文件夹中有用于生成数据帧的空间，Spark无法识别文件夹中的空格字符，因此认为该文件夹不再存在

但谢谢@Constantine的帮助

我想我可以结束我的问题了，因为我找到了答案

如果您有这种类型的错误，也可能是因为S3文件夹中有用于生成数据帧的空间，Spark无法识别文件夹中的空格字符，因此认为该文件夹不再存在

但谢谢@Constantine的帮助

嗨，谢谢你的帮助！因此，如果我理解得很好，在

red\u dim\u udf=pandas\u udf（red\u dim，returnType=ArrayType（DoubleType（））

之后，我将red\u dim\u udf移动到S3存储桶的另一个文件夹中，然后，我使用这个新文件夹中的变量作为以下行的条目？（只是添加一些信息：我只是不知道如何在转换之前移动我的数据帧，因为'df'只是一个Spark数据帧，包含Spark格式的所有图像，'results'是数据帧，只是包含图像的来源（如'df'）和从函数计算的特征（因此在转换之后）…而且我没有其他的。如何更改'df'的目录（如果我理解得很好），因为这个数据帧只存储在SparkContext的内存中（我想）？只需使用简单的FS copy或使用sparkHi读取和写入另一个目录，谢谢你的帮助！因此，如果我理解得很好，在

red\u dim\u udf=pandas\u udf之后（red\u dim，returnType=ArrayType（DoubleType（））

，我将red\u dim\u udf移动到S3存储桶的另一个文件夹中，然后，我使用这个新文件夹中的变量作为以下行的条目？（只是添加一些信息：我只是不知道如何在转换之前移动我的数据帧，因为'df'只是一个Spark数据帧，包含Spark格式的所有图像，'results'是数据帧，只是包含图像的来源（如'df'）和从函数计算的特征（因此在转换之后）…我没有其他的。如何更改'df'的目录（如果我理解得很好的话），因为这个数据帧只存储在SparkContext的内存中（我想）？只需使用simple FS copy或sparkCool读取和写入另一个目录。很高兴知道此spark错误还有其他原因。很高兴知道此spark错误还有其他原因