Pandas 使用pyspark将base64字符串转换为图像时出错_Pandas_Apache Spark_Pyspark_Base64_Pyarrow

Pandas 使用pyspark将base64字符串转换为图像时出错

pandas apache-spark pyspark

Pandas 使用pyspark将base64字符串转换为图像时出错,pandas,apache-spark,pyspark,base64,pyarrow,Pandas,Apache Spark,Pyspark,Base64,Pyarrow,我想使用pyspark提取和处理base64格式的图像数据（3D阵列）。我将pandas_udf与pyarrow一起用作处理函数。在将base64字符串解析为udf函数时，首先将base64字符串转换为image。但是，在这一步，我得到的错误是“TypeError:file（）参数1必须是没有空字节的编码字符串，而不是str。” 我使用函数base64.b64decode（imgString）将base64字符串转换为图像。我正在使用python 2.7 如果您可以提供一个可复制/粘贴的MCV

我想使用pyspark提取和处理base64格式的图像数据（3D阵列）。我将pandas_udf与pyarrow一起用作处理函数。在将base64字符串解析为udf函数时，首先将base64字符串转换为image。但是，在这一步，我得到的错误是“TypeError:file（）参数1必须是没有空字节的编码字符串，而不是str。”

我使用函数base64.b64decode（imgString）将base64字符串转换为图像。我正在使用python 2.7

如果您可以提供一个可复制/粘贴的MCVE，社区可以轻松运行并自行查看问题，您将获得答案。您是否可能需要几行代码？如果您可以提供一个可复制/粘贴的MCVE，社区可以轻松运行并自行查看问题，您将获得答案。您是否有可能需要几行代码？

avrodf=sqlContext.read.format("com.databricks.spark.avro").load("hdfs:///Raw_Images_201803182350.avro")
interested_cols = ["id","name","image_b64"]
indexed_avrodf = avrodf.select(interested_cols)
ctx_cols = ["id","name"]
result_sdf = indexed_avrodf.groupby(ctx_cols).apply(img_proc)

schema = StructType([
    StructField("id",StringType()),
    StructField("name",StringType()),
    StructField("image",StringType()),
    StructField("Proc_output",StringType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def img_proc(df):
    df['Proc_output'] = df['image_b64'].apply(is_processed)
    return df

def is_processed(imgString):
    import cv2
    from PIL import Image, ImageDraw, ImageChops
    import base64

    wisimg = base64.b64decode(imgString)
    image = Image.open(wisimg)

    .....

    return processed_status