Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/spring-boot/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python aws粘合作业:如何在s3中合并多个输出.csv文件_Python_Amazon Web Services_Amazon S3_Jobs_Aws Glue - Fatal编程技术网

Python aws粘合作业:如何在s3中合并多个输出.csv文件

Python aws粘合作业:如何在s3中合并多个输出.csv文件,python,amazon-web-services,amazon-s3,jobs,aws-glue,Python,Amazon Web Services,Amazon S3,Jobs,Aws Glue,我创建了一个aws胶水爬虫和作业。目的是将数据从postgres RDS数据库表传输到S3中的单个.csv文件。一切正常,但我在S3中总共得到19个文件。每个文件都是空的,只有三个文件中有一行数据库表和标题。因此,数据库的每一行都被写入一个单独的.csv文件。 我可以在这里做些什么来指定我只需要一个文件,其中第一行是标题,之后数据库的每一行都在后面 import sys from awsglue.transforms import * from awsglue.utils import getR

我创建了一个aws胶水爬虫和作业。目的是将数据从postgres RDS数据库表传输到S3中的单个.csv文件。一切正常,但我在S3中总共得到19个文件。每个文件都是空的,只有三个文件中有一行数据库表和标题。因此,数据库的每一行都被写入一个单独的.csv文件。 我可以在这里做些什么来指定我只需要一个文件,其中第一行是标题,之后数据库的每一行都在后面

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("password", "string", "password", "string"), ("user_id", "string", "user_id", "string"), ("username", "string", "username", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("user_id", "string", "user_id", "string"), ("username", "string", "username", "string"),("password", "string", "password", "string")], transformation_ctx = "applymapping1")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2"]
## @return: datasink2
## @inputs: [frame = applymapping1]
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
数据库如下所示:

在S3中看起来是这样的:

S3中的一个示例.csv如下所示:

password,user_id,username
346sdfghj45g,user3,dieter
正如我所说,每个表行有一个文件

编辑: s3的多端口负载似乎无法正常工作。它只是上传部分,但完成后不会将它们合并在一起。以下是作业日志的最后几行: 以下是日志的最后几行:

19/04/04 13:26:41 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/04/04 13:26:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/04/04 13:26:41 INFO Executor: Finished task 16.0 in stage 2.0 (TID 18). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:false s3://bucketname/run-1554384396528-part-r-00018
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00017
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:false s3://bucketname/run-1554384396528-part-r-00019
19/04/04 13:26:41 INFO Executor: Finished task 17.0 in stage 2.0 (TID 19). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00018
19/04/04 13:26:41 INFO Executor: Finished task 18.0 in stage 2.0 (TID 20). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00019
19/04/04 13:26:41 INFO Executor: Finished task 19.0 in stage 2.0 (TID 21). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver from 172.31.20.76:39779 disconnected during shutdown
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver from 172.31.20.76:39779 disconnected during shutdown
19/04/04 13:26:41 INFO MemoryStore: MemoryStore cleared
19/04/04 13:26:41 INFO BlockManager: BlockManager stopped
19/04/04 13:26:41 INFO ShutdownHookManager: Shutdown hook called
End of LogType:stderr

你能试试下面的吗

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("user_id", "string", "user_id", "string"), ("username", "string", "username", "string"),("password", "string", "password", "string")], transformation_ctx = "applymapping1")

## Force one partition, so it can save only 1 file instead of 19
repartition = applymapping1.repartition(1)

datasink2 = glueContext.write_dynamic_frame.from_options(frame = repartition, connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
另外,如果要检查当前有多少个分区,可以尝试以下代码。我猜有19个,这就是为什么要将19个文件保存回s3:

 ## Change to Pyspark Dataframe
 dataframe = DynamicFrame.toDF(applymapping1)
 ## Print number of partitions   
 print(dataframe.rdd.getNumPartitions())
 ## Change back to DynamicFrame
 datasink2 = DynamicFrame.fromDF(dataframe, glueContext, "datasink2")

更多信息:发现这些不是csv文件。这些是s3上传的部分,因为作业使用mulitpartupload。但为什么它在完成上传后不合并所有部分呢?我可能错了,但可能是因为Spark执行器的数量太多了。查看集群,看看您有多少执行者。应该有一个名为
collect
的函数,它允许您收集所有结果并输出到单个文件中。如果输出较大,则不建议这样做,因为您可以从数据目录发布最终的表定义吗?@simplycoding您可以告诉我有关collect函数的更多信息吗?我怎么称呼它?@Aidarmartinez给你:谢谢!该文件仍然称为“run-1554901534650-part-r-00000”,但现在它包含所有行。这是向前迈出的一大步。最后一个问题。是否有可能以某种方式将.csv结尾添加到该文件中?我认为发生了一些奇怪的事情。它应该将文件保存为.csv格式。看这个例子,我使用spark方法来编写,因为它们更加灵活,所以您也可以尝试一下
datasink4.write.\format(“com.databricks.spark.csv”)。\option(“header”、“true”)。\mode(“overwrite”)。\save(“s3://您的bucket name”)
在中找到了解决方案。我不知道,这是不是有意的。但它似乎奏效了。我删除了datasink 2行,并在重新分区后添加了以下行:“repartition.toDF().write.mode(“overwrite”).format(“csv”).save(“s3://BUCKETNAME/subfolder”)。现在它是一个csv文件