Python 如何从Spark写入多个分区？_Python_Apache Spark_Pyspark_Pyspark Sql

Python 如何从Spark写入多个分区？

python apache-spark pyspark

Python 如何从Spark写入多个分区？,python,apache-spark,pyspark,pyspark-sql,Python,Apache Spark,Pyspark,Pyspark Sql,我有一个大约1.5kb的小文件，它作为一个文件写入S3。实际上，我想将它作为多个部分文件写入S3以测试分区，但我遇到了麻烦。我如何设置它来实现这一点？有什么我应该做的和这里不同的吗 from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType spark = SparkSession \ .b

我有一个大约1.5kb的小文件，它作为一个文件写入S3。实际上，我想将它作为多个部分文件写入S3以测试分区，但我遇到了麻烦。我如何设置它来实现这一点？有什么我应该做的和这里不同的吗

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.ui.enabled", "true") \
    .config("spark.default.parallelism", "4") \
    .config("spark.files.maxPartitionBytes", "500") \
    .master("yarn-client") \
    .getOrCreate()

myschema = StructType([\
                         StructField("field1", TimestampType(), True), \
                         StructField("field2", TimestampType(), True), \
                         StructField("field3", StringType(), True),
                         StructField("field4", StringType(), True), \
                         StructField("field5", StringType(), True)
                         ])

mydf= spark.read.load("s3a://bucket/myfile.csv",\
                     format="csv", \
                     sep=",", \
                     # inferSchema="true", \
                     timestampFormat="MM/dd/yyyy HH:mm:ss",
                     header="true",
                     schema=scheduled_schema
                    )

mydf.coalesce(5) 

df_scheduled.write.csv(path="s3a://bucket/output",\
                     header="true",
                    )

将

mydf.coalesce（5）

替换为

mydf.repartition（5）

。仍然无法使用

repartition

你说它不工作是什么意思？你在s3上看到了什么？我看到了与以前相同的输出，这是一个包含

\u SUCCESS

文件和一个输出文件的文件夹