Apache spark pyspark将多个csv文件连接在一个文件中_Apache Spark_Hadoop_Pyspark

Apache spark pyspark将多个csv文件连接在一个文件中

apache-spark hadoop pyspark

Apache spark pyspark将多个csv文件连接在一个文件中,apache-spark,hadoop,pyspark,Apache Spark,Hadoop,Pyspark,我需要使用pyspark中的函数concat（Path trg，Path[]psrcs）我的代码是： orig1_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename1}') orig2_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename2}') dest_fs = spark._jvm.org.apache.hadoop.fs.Path(des

我需要使用pyspark中的函数concat（Path trg，Path[]psrcs）

我的代码是：

orig1_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename1}')
orig2_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename2}')
dest_fs = spark._jvm.org.apache.hadoop.fs.Path(dest_path)    
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
fs.concat(dest_fs, list((orig1_fs , orig2_fs)))

但我得到了一个错误：

如何使用该函数？

这是因为

concat

方法的第二个参数是

Array

而不是

ArrayList

#从'ArrayList'转换为'Path[]`
py_路径=[orig1_fs，orig2_fs]
java_Path=sc._gateway.new_数组（spark._jvm.org.apache.hadoop.fs.Path，len（py_路径））
对于范围内的i（len（py_路径））：
java_路径[i]=py_路径[i]
#现在可以使用新阵列了
concat（dest\u fs，java\u路径）