Python 3.x Spark：从数据帧行中的路径列表中读取CSV文件_Python 3.x_Apache Spark_Pyspark

Python 3.x Spark：从数据帧行中的路径列表中读取CSV文件

python-3.x apache-spark pyspark

Python 3.x Spark：从数据帧行中的路径列表中读取CSV文件,python-3.x,apache-spark,pyspark,Python 3.x,Apache Spark,Pyspark,我有一个Spark数据框，如下所示： # --------------------------------- # - column 1 - ... - column 5 - # --------------------------------- # - ... - Array of paths 第1列到第4列包含字符串，第五列包含字符串列表，它们实际上是指向CSV文件的路径，我希望将其读取为Spark数据帧。我找不到任何方法来阅读它们。这是一个简化版本，其中只有

我有一个Spark数据框，如下所示：

# ---------------------------------
# - column 1 - ...  -   column 5  -
# ---------------------------------
# - ...             - Array of paths

第1列到第4列包含字符串，第五列包含字符串列表，它们实际上是指向CSV文件的路径，我希望将其读取为Spark数据帧。我找不到任何方法来阅读它们。这是一个简化版本，其中只有一列和一列路径列表：

from pyspark.sql import SparkSession,Row

spark = SparkSession \
        .builder \
        .appName('test') \
        .getOrCreate()

simpleRDD = spark.sparkContext.parallelize(range(10))
simpleRDD = simpleRDD.map(lambda x: Row(**{'a':x,'paths':['{}_{}.csv'.format(y**2,y+1) for y in range(x+1)]}))

simpleDF = spark.createDataFrame(simpleRDD)
print(simpleDF.head(5))

这使得：

[Row(a=0, paths=['0_1.csv']),  
 Row(a=1, paths=['0_1.csv', '1_2.csv']),  
 Row(a=2, paths=['0_1.csv', '1_2.csv', '4_3.csv']),  
 Row(a=3, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv']),  
 Row(a=4, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv', '16_5.csv'])]

然后我想做这样的事情：

simpleDF = simpleDF.withColumn('data',spark.read.csv(simpleDF.paths))

…但这当然不起作用。

我不确定从路径读入

数据框

对象后，您打算如何存储它们，但如果是访问

数据框

列中的值，您可以使用

.collect（）

方法以

行

对象列表的形式返回

数据帧

（就像

RDD

）

from pyspark.sql import SparkSession,Row

from pyspark.sql.types import *

spark = SparkSession \
        .builder \
        .appName('test') \
        .getOrCreate()

inp=[['a','b','c','d',['abc\t1.txt','abc\t2.txt','abc\t3.txt','abc\t4.txt','abc\t5.txt',]],
            ['f','g','h','i',['def\t1.txt','def\t2.txt','def\t3.txt','def\t4.txt','def\t5.txt',]],
            ['k','l','m','n',['ghi\t1.txt','ghi\t2.txt','ghi\t3.txt','ghi\t4.txt','ghi\t5.txt',]]
           ]

inp_data=spark.sparkContext.parallelize(inp)

##Defining the schema

schema = StructType([StructField('field1',StringType(),True),
                      StructField('field2',StringType(),True),
                      StructField('field3',StringType(),True),
                      StructField('field4',StringType(),True),
                      StructField('field5',ArrayType(StringType(),True))
                     ])

## Create the Data frames

dataframe=spark.createDataFrame(inp_data,schema)
dataframe.createOrReplaceTempView("dataframe")
dataframe.select("field5").filter("field1='a'").show()

每个

行

对象都有一个

.asDict（）

方法，该方法将其转换为Python

字典

对象。一旦你到了那里，你就可以通过索引字典的键来访问这些值

假设您正在将返回的

DataFrames

内容存储在列表中，您可以尝试以下操作：

# collect the DataFrame into a list of Rows
rows = simpleRDD.collect()

# collect all the values in your `paths` column
# (note that this will return a list of lists)

paths = map(lambda row: row.asDict().get('paths'), rows)

# flatten the list of lists
paths_flat = [path for path_list in paths for path in path_list]

# get the unique set of paths 
paths_unique = list(set(paths_flat))

# instantiate an empty dictionary in which to collect DataFrames

dfs_dict = []
for path in paths_unique:
    dfs_dict[path] = spark.read.csv(path)

您的

dfs\u dict

现在将包含所有

数据帧。要获取特定路径的DataFrame
，可以使用路径作为字典键进行访问：
df_0_01 = dfs_dict['0_1.csv']

首先，请正确格式化您的输入DataFrame
。其次，在数据帧行中添加路径列表
。数据帧行中有一个路径列表（“路径”：…）。关键是，这个路径列表取决于另一列中的内容。我给出的数据帧格式只是一个例子，它与问题的目标无关。