Apache spark 在pyspark中合并多个数据帧_Apache Spark_Pyspark

Apache spark 在pyspark中合并多个数据帧

apache-spark pyspark

Apache spark 在pyspark中合并多个数据帧,apache-spark,pyspark,Apache Spark,Pyspark,我需要合并20个数据帧，每个数据帧有成千上万条记录每个数据帧有2列： df1： root |-- id: string (nullable = true) |-- col1: string (nullable = true) df2： root |-- id: string (nullable = true) |-- col2: string (nullable = true) 最终df： root |-- id: string (nullable = true

我需要合并20个数据帧，每个数据帧有成千上万条记录

每个数据帧有2列：

df1：

root
  |-- id: string (nullable = true)
  |-- col1: string (nullable = true)

df2：

root
  |-- id: string (nullable = true)
  |-- col2: string (nullable = true)

最终df：

root
  |-- id: string (nullable = true)
  |-- col1: string (nullable = true) 
  |-- col1: string (nullable = true) 
  .
  .
  |-- col19: string (nullable = true)

我试过了

df = df1 
        .join(df2, 'ID', 'full') \
        .join(df3, 'ID', 'full') \
        .join(df4, 'ID', 'full') \
        .join(df5, 'ID', 'full') 
        .
        .
        .
        .join(df19, 'ID', 'full')

它在30-40分钟后失败，并且没有内存剩余错误。尝试使用4-16个执行器和8 GB内存
数据帧中存在重复的ID。因此，情况变得更糟

是否有其他方法对这些数据帧执行合并

在加入之前对副本进行排序和删除是否有帮助

加入订单也会像保持高记录一样重要吗

将20个联接拆分为多个联接（例如，5个批次）并对其执行操作（例如计数）然后联接这些批次是否有帮助

如果您规范化df列名，比如这样的话会怎么样

df1 root |-- id: string (nullable = true) |-- **col1**: string (nullable = true) df2 root |-- id: string (nullable = true) |-- **col1**: string (nullable = true)
在那之后你就可以结婚了

df1.union(df2).dropDuplicates(subset="id")

删除重复项肯定会有帮助，否则您可能会经历数据的指数级增长：如果10个数据帧有2行具有相同id，那么在它们加入后，您将获得pow（2,10）=结果中具有此id的1024行。您的目标是
加入这些数据帧或对其执行联合？@moriarty007所有数据帧都有一个不同的列。所以最终的df应该有1+个dataframes列。我相信这个合并是通过join完成的。试着把join分解成多个语句df1to5=df1.join（df2，'ID'，'full'）\.join（df3，'ID'，'full'）\.join（df4，'ID'，'full'）\.join（df5，'ID'，'full'） df1to5.join（df6to10，“ID”，“full”）
。等等。看看这是否有效。Union按行添加数据。我想合并所有数据帧中的所有列。显然加入是我唯一的选择。