Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/amazon-web-services/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python AWS粘合作业失败,出现错误:“0”;“写入行时出错”;_Python_Amazon Web Services_Pyspark_Aws Glue - Fatal编程技术网

Python AWS粘合作业失败,出现错误:“0”;“写入行时出错”;

Python AWS粘合作业失败,出现错误:“0”;“写入行时出错”;,python,amazon-web-services,pyspark,aws-glue,Python,Amazon Web Services,Pyspark,Aws Glue,所以,我创建了一个ETL作业,将数据从S3存储桶放到Redshift。作业步骤(转换): CreateDynamicFrame(从S3读取) 应用映射 选择字段 解决方案选择 DropNullFields WriteDynamicFrame(写入到红移) 如果S3存储桶中的输入文件(gzip)很小(约15 Mb),则作业将正常运行。但如果任何文件超过~20 Mb,作业将失败,并出现错误“错误写入行”RecordCount:19xxxxxx RecordData:()。令人惊讶的是,该行被手动插入

所以,我创建了一个ETL作业,将数据从S3存储桶放到Redshift。作业步骤(转换):

  • CreateDynamicFrame(从S3读取)
  • 应用映射
  • 选择字段
  • 解决方案选择
  • DropNullFields
  • WriteDynamicFrame(写入到红移)
  • 如果S3存储桶中的输入文件(gzip)很小(约15 Mb),则作业将正常运行。但如果任何文件超过~20 Mb,作业将失败,并出现错误“错误写入行”RecordCount:19xxxxxx RecordData:()。令人惊讶的是,该行被手动插入到Redshift,而没有错误。 现在我认为这一定和输入文件的大小限制有关。有没有具体的限制?另外,我如何解决这个问题,因为我不能操作输入文件

    下面是Glue Python脚本:

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    args = getResolvedOptions(sys.argv, 
    [   'TempDir',
        'JOB_NAME'
    ])
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "raw_data", table_name = "raw_table", transformation_ctx = "datasource0")
    
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [('col0', 'long', 'zip', 'int'), ('col1', 'string', 'state', 'string'), ('col2', 'string', 'mail_address', 'string'), ("other_columns_mapping")], transformation_ctx = "applymapping1")
    
    selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["zip", "state", "mail_address", "other_column_names"], transformation_ctx = "selectfields2")
    
    resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "test_db", table_name = "dev_public_staging_table", transformation_ctx = "resolvechoice3")
    
    dropnullfields4 = DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "dropnullfields4")
    
    datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "test_db", table_name = "dev_public_staging_table", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
    
    job.commit()
    
    错误日志:

    18/08/20 14:56:13 WARN TaskSetManager: Lost task 16.0 in stage 2.0 (TID 36, ip-172-31-55-211.us-west-2.compute.internal, executor 8): org.apache.spark.SparkE
    xception: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Caused by: com.univocity.parsers.common.TextWritingException: Error writing row.
    Internal state when error was thrown: recordCount=195778, recordData=[42101, 4941 CHARLES ST, PHILADELPHIA, PA, 19124, 2815, , , 4941 , , CHARLES, ST , , C034, 19960614, 0018 , 0301 , , DE, 23-2-3302-00, , , ROBERT, PENDER, , , , AMELIA SHERRILL, LEWIS, , , , , , PHILADELPHIA, PA, 19124, 2815, , 95 , 90 N 24, , , , , PHILADELPHIA, , , , SURVEY & PLAN MADE JOSEPH C BARNARD ESQUIRE, , , Y, SFR, 19960229, 34900, D, 0.000000, 0.000000, 34900.000000, , , 0.0000, , , 0.000000, , 0.0000, 4941 CHARLES ST, , , 103477, , , , , , , , , , , , , , , , , 1007, 1, 0, 0, 0, 0, 0, , 0, @NULL@, @NULL@]
    at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:916)
    at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:706)
    at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:82)
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:139)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:325)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
    ... 8 more
    Suppressed: java.lang.IllegalStateException: Error closing the output.
    at com.univocity.parsers.common.AbstractWriter.close(AbstractWriter.java:861)
    at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.close(UnivocityGenerator.scala:86)
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.close(CSVFileFormat.scala:141)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:335)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:262)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1380)
    ... 9 more
    Caused by: java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING_END
    at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:992)
    at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:572)
    at sun.nio.cs.StreamEncoder.flushLeftoverChar(StreamEncoder.java:242)
    at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:301)
    at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
    at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
    at com.univocity.parsers.common.AbstractWriter.close(AbstractWr
    iter.java:857)
    ... 14 more
    Caused by: java.lang.IllegalStateException: Error closing the output.
    at com.univocity.parsers.common.AbstractWriter.close(AbstractWriter.java:861)
    at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:903)
    at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:811)
    at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:704)
    ... 15 more
    Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received. (Service: Amazon S3; Status Code: 400; Error Code: BadDigest;
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1588)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1258)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4169)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4116)
    at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1700)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.call.PutObjectCall.performCall(PutObjectCall.java:34)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.call.PutObjectCall.performCall(PutObjectCall.java:9)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.call.AbstractUploadingS3Call.perform(AbstractUploadingS3Call.java:62)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:80)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
    at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.putObject(AmazonS3LiteClient.java:104)
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:165)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy40.storeFile(Unknown Source)
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:193)
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:393)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
    at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:320)
    at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
    at java.io.Output
    
    @recordData中的NULL@是因为红移表比原始文件多了两列。MD5错误似乎是由于临时文件(part-000x)由于错误而未成功创建所致