如何重命名pyspark生成的JSON?
当我用如何重命名pyspark生成的JSON?,json,apache-spark,pyspark,Json,Apache Spark,Pyspark,当我用 dataframe.coalesce(1).write.format('json') 在pyspark上,我无法更改分区中文件的名称 我这样写我的JSON: dataframe.coalesce(1).write.format('json').mode('overwrite').save('path') 但是我无法更改分区中文件的名称 我想要这样的路径: dataframe.coalesce(1).write.format('json').mode('overwrite').save
dataframe.coalesce(1).write.format('json')
在pyspark上,我无法更改分区中文件的名称
我这样写我的JSON:
dataframe.coalesce(1).write.format('json').mode('overwrite').save('path')
但是我无法更改分区中文件的名称
我想要这样的路径:
dataframe.coalesce(1).write.format('json').mode('overwrite').save('path')
/文件夹/my_name.json
其中'my_name.json'是spark
中的json文件,我们无法控制写入目录的文件的名称
首先将数据写入HDFS目录
,然后为了更改文件名,我们需要使用HDFS api
示例:
l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)
hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename
df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
#list files in the directory
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))
#filter name of the file starts with part-
file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]
#rename the file
fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))
val df=Seq(("a",1)).toDF("id","sa")
df.show(false)
import org.apache.hadoop.fs._
val hdfs_dir = "/folder/"
val new_filename="new_json.json"
df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)
val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString
fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))
fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))
Pyspark中的:
l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)
hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename
df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
#list files in the directory
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))
#filter name of the file starts with part-
file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]
#rename the file
fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))
val df=Seq(("a",1)).toDF("id","sa")
df.show(false)
import org.apache.hadoop.fs._
val hdfs_dir = "/folder/"
val new_filename="new_json.json"
df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)
val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString
fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))
fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))
如果要删除目录中的成功文件
,请使用fs.delete
删除\u成功文件
Scala中的:
l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)
hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename
df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
#list files in the directory
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))
#filter name of the file starts with part-
file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]
#rename the file
fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))
val df=Seq(("a",1)).toDF("id","sa")
df.show(false)
import org.apache.hadoop.fs._
val hdfs_dir = "/folder/"
val new_filename="new_json.json"
df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)
val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString
fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))
fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))
在spark中,我们无法控制写入目录的文件的名称
首先将数据写入HDFS目录
,然后为了更改文件名,我们需要使用HDFS api
示例:
l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)
hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename
df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
#list files in the directory
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))
#filter name of the file starts with part-
file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]
#rename the file
fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))
val df=Seq(("a",1)).toDF("id","sa")
df.show(false)
import org.apache.hadoop.fs._
val hdfs_dir = "/folder/"
val new_filename="new_json.json"
df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)
val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString
fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))
fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))
Pyspark中的:
l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)
hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename
df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
#list files in the directory
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))
#filter name of the file starts with part-
file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]
#rename the file
fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))
val df=Seq(("a",1)).toDF("id","sa")
df.show(false)
import org.apache.hadoop.fs._
val hdfs_dir = "/folder/"
val new_filename="new_json.json"
df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)
val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString
fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))
fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))
如果要删除目录中的成功文件
,请使用fs.delete
删除\u成功文件
Scala中的:
l=[("a",1)]
ll=["id","sa"]
df=spark.createDataFrame(l,ll)
hdfs_dir = "/folder/" #your hdfs directory
new_filename="my_name.json" #new filename
df.coalesce(1).write.format("json").mode("overwrite").save(hdfs_dir)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
#list files in the directory
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(hdfs_dir))
#filter name of the file starts with part-
file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]
#rename the file
fs.rename(Path(hdfs_dir+''+file_name),Path(hdfs_dir+''+new_filename))
val df=Seq(("a",1)).toDF("id","sa")
df.show(false)
import org.apache.hadoop.fs._
val hdfs_dir = "/folder/"
val new_filename="new_json.json"
df.coalesce(1).write.mode("overwrite").format("json").save(hdfs_dir)
val fs=FileSystem.get(sc.hadoopConfiguration)
val f=fs.globStatus(new Path(s"${hdfs_dir}" + "*")).filter(x => x.getPath.getName.toString.startsWith("part-")).map(x => x.getPath.getName).mkString
fs.rename(new Path(s"${hdfs_dir}${f}"),new Path(s"${hdfs_dir}${new_filename}"))
fs.delete(new Path(s"${hdfs_dir}" + "_SUCCESS"))
我不认为,您可以控制输出文件的名称。您只能提供文件夹名称。我不认为,您可以控制输出文件的名称。您只能提供文件夹名称。不确定如何正确使用python中的java Path类不确定如何正确使用python中的java Path类