使用pySpark将hdfs中的零件文件读取到数据帧中_Pyspark_Apache Spark Sql_Hdfs_Partitioning

使用pySpark将hdfs中的零件文件读取到数据帧中

pyspark

使用pySpark将hdfs中的零件文件读取到数据帧中,pyspark,apache-spark-sql,hdfs,partitioning,Pyspark,Apache Spark Sql,Hdfs,Partitioning,我有多个文件存储在hdfs位置，如下所示 /用户/项目/202005/part-01798 /用户/项目/202005/part-01799 有2000个这样的零件文件。每个文件的格式都相同 {'Name':'abc','Age':28,'Marks':[20,25,30]} {'Name':...} 等等。我有两个问题 1) How to check whether these are multiple files or multiple partitions of the same f

我有多个文件存储在hdfs位置，如下所示

/用户/项目/202005/part-01798

/用户/项目/202005/part-01799

有2000个这样的零件文件。每个文件的格式都相同

{'Name':'abc','Age':28,'Marks':[20,25,30]} 
{'Name':...}

等等。我有两个问题

1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark

由于这些文件位于一个目录中，它们被命名为part-xxxxx文件，因此您可以安全地假设它们是同一数据集的多个part文件。如果这些是分区，则应按以下方式保存/user/project/date=202005/*

您可以指定dir“/user/project/202005”作为spark的输入，如下所示，假设这些是csv文件

spark.read.json

帮了我的忙。但谢谢你给我指出了正确的方向，并澄清了第一个问题

df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)