Csv AWS Glue write_dynamic_frame_from_options遇到架构异常_Csv_Pyspark_Aws Glue_Aws Glue Spark

Csv AWS Glue write_dynamic_frame_from_options遇到架构异常

csv pyspark

Csv AWS Glue write_dynamic_frame_from_options遇到架构异常,csv,pyspark,aws-glue,aws-glue-spark,Csv,Pyspark,Aws Glue,Aws Glue Spark,我是Pyspark和AWS Glue的新手，当我试图用Glue写出一个文件时，我遇到了一个问题。当我尝试使用Glue的write_dynamic_frame_from_options将一些输出写入s3时，它得到一个异常并说 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 199.0 failed 4 times, most recent failure: Lost tas

我是Pyspark和AWS Glue的新手，当我试图用Glue写出一个文件时，我遇到了一个问题。当我尝试使用Glue的write_dynamic_frame_from_options将一些输出写入s3时，它得到一个异常并说

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 199.0 failed 4 times, most recent failure:
 Lost task 0.3 in stage 199.0 (TID 7991, 10.135.30.121, executor 9): java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 7, schema size: 6
CSV file: s3://************************************cache.csv
    at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:180)
    at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:176)
    at scala.Option.foreach(Option.scala:257)
    at .....

它似乎在说我的数据框架的模式有6个字段，而csv有7个字段。我不明白它说的是哪个csv，因为我实际上是在尝试从数据帧创建一个新的csv。。。对这个特定问题的任何见解，或者对write_dynamic_frame_from_options方法的工作原理的任何了解，都将非常有帮助

下面是我的工作中导致此问题的函数的源代码



def update_geocache(glueContext, originalDf, newDf):
    logger.info("Got the two df's to union")
    logger.info("Schema of the original df")
    originalDf.printSchema()
    logger.info("Schema of the new df")
    newDf.printSchema()
    # add the two Dataframes together
    unioned_df = originalDf.unionByName(newDf).distinct()
    logger.info("Schema of the union")
    unioned_df.printSchema()
            ##root
            #|-- location_key: string (nullable = true)
            #|-- addr1: string (nullable = true)
            #|-- addr2: string (nullable = true)
            #|-- zip: string (nullable = true)
            #|-- lat: string (nullable = true)
            #|-- lon: string (nullable = true)



    # Create just 1 partition, because there is so little data
    unioned_df = unioned_df.repartition(1)
    logger.info("Unioned the geocache and the new addresses")
    # Convert back to dynamic frame
    dynamic_frame = DynamicFrame.fromDF(
        unioned_df, glueContext, "dynamic_frame")
    logger.info("Converted the unioned tables to a Dynamic Frame")
    # Write data back to S3
    # THIS IS THE LINE THAT THROWS THE EXCEPTION
    glueContext.write_dynamic_frame.from_options(
        frame=dynamic_frame,
        connection_type="s3",
        connection_options={
            "path": "s3://" + S3_BUCKET + "/" + TEMP_FILE_LOCATION
        },
        format="csv"
    )

看起来您的标题可能有额外的逗号或列。您可以在问题中发布标题和记录，并且在阅读时尝试禁用标题dyF=glueContext。创建动态框架。从选项（'s3'，{'paths'：['s3://path']}，'csv'，{'withHeader'：False}）谢谢@Prabhakarredy！我将尝试withheader false以查看发生了什么。。。但是，我不明白你的第一句话。你要求我发布标题，你指的是什么标题？它不应该将我的动态框架写入csv吗？在上面的代码中，您可以看到dfI的模式是关于源代码的。我刚刚用“withHeaders”再次运行了它：False，它仍然得到相同的异常。您可以尝试在编写时传递相同的标志，并且您是否在此作业中启用了glue Catalog？如果是，请尝试禁用它