Python 什么'；在pyspark中积累数据帧最有效的方法是什么？_Python_Apache Spark_Dataframe_Pyspark

Python 什么'；在pyspark中积累数据帧最有效的方法是什么？

python apache-spark dataframe pyspark

Python 什么'；在pyspark中积累数据帧最有效的方法是什么？,python,apache-spark,dataframe,pyspark,Python,Apache Spark,Dataframe,Pyspark,我有一个数据帧（或者可以是任何RDD），在一个众所周知的模式中包含数百万行，如下所示： Key | FeatureA | FeatureB -------------------------- U1 | 0 | 1 U2 | 1 | 1 Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF ---------------------------

我有一个数据帧（或者可以是任何RDD），在一个众所周知的模式中包含数百万行，如下所示：

Key | FeatureA | FeatureB
--------------------------
U1  |        0 |         1
U2  |        1 |         1

Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF 
---------------------------------------------------------------------
U1  |        0 |        1 |        0 |        0 |        1 |        0
U2  |        1 |        1 |        0 |        0 |        0 |        1

我需要从包含相同数量密钥的不同功能的磁盘加载十几个其他数据集。有些数据集最多有十几列宽。想象一下：

Key | FeatureC | FeatureD |  FeatureE
-------------------------------------
U1  |        0 |        0 |         1

Key | FeatureF
--------------
U2  |        1

这感觉像是一个折叠或累积，我只想迭代所有数据集，然后得到如下结果：

Key | FeatureA | FeatureB
--------------------------
U1  |        0 |         1
U2  |        1 |         1

Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF 
---------------------------------------------------------------------
U1  |        0 |        1 |        0 |        0 |        1 |        0
U2  |        1 |        1 |        0 |        0 |        0 |        1

我试着加载每个数据帧，然后加入，但一旦我通过了一些数据集，这将花费很长时间。我是否缺少完成此任务的通用模式或有效方法

假设每个

DataFrame

中每个键最多有一行，并且所有键都是基元类型，您可以尝试使用聚合进行联合。让我们从一些导入和示例数据开始：

from itertools import chain
from functools import reduce
from pyspark.sql.types import StructType
from pyspark.sql.functions import col, lit, max
from pyspark.sql import DataFrame

df1 = sc.parallelize([
    ("U1", 0, 1), ("U2", 1, 1)
]).toDF(["Key", "FeatureA", "FeatureB"])

df2 = sc.parallelize([
  ("U1", 0, 0, 1)
]).toDF(["Key", "FeatureC", "FeatureD", "FeatureE"])

df3 = sc.parallelize([("U2", 1)]).toDF(["Key", "FeatureF"])

dfs = [df1, df2, df3]

接下来，我们可以提取公共模式：

output_schema = StructType(
  [df1.schema.fields[0]] + list(chain(*[df.schema.fields[1:] for df in dfs]))
)

并转换所有

数据帧

：

transformed_dfs = [df.select(*[
  lit(None).cast(c.dataType).alias(c.name) if c.name not in df.columns 
  else col(c.name)
  for c in output_schema.fields
]) for df in dfs]

最后是联合和虚拟聚合：

combined = reduce(DataFrame.unionAll, transformed_dfs)
exprs = [max(c).alias(c) for c in combined.columns[1:]]
result = combined.repartition(col("Key")).groupBy(col("Key")).agg(*exprs)

如果每个键有多行，但单个列仍然是原子的，您可以尝试将

max

替换为

collect\u list

collect\u set

后跟

explode

作为一些背景信息，我尝试过按键对数据帧进行排序，希望利用一些分区，但在执行时间上没有看到很大的变化。我还尝试过将数据视为字符串，按键缩减并连接值，这实际上效果相当好。@zero323 AFAIK没有办法显式地对数据帧进行分区，除了orderBy之外，应该在逻辑计划中使用它（我确实尝试过）。如果我将其转换为RDD，那么我可能能够聚合eByKey？让我们来看看。为什么不使用Spark？引入的“合并”功能呢？太棒了。现在正在研究这一方法，并将在今天以问题/评论/诸如此类的方式进行报告。