Pyspark 将数据从多行选择到一行

Pyspark 将数据从多行选择到一行,pyspark,Pyspark,我对函数式编程和pyspark非常陌生,目前我很难从源数据中压缩所需的数据 假设我有两个表作为数据帧: # if not already created automatically, instantiate Sparkcontext spark = SparkSession.builder.getOrCreate() columns = ['Id', 'JoinId', 'Name'] vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (

我对函数式编程和pyspark非常陌生,目前我很难从源数据中压缩所需的数据

假设我有两个表作为数据帧:

# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()

columns = ['Id', 'JoinId', 'Name']
vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')]
persons = spark.createDataFrame(vals,columns)
columns = ['Id', 'JoinId', 'Specification', 'Date', 'Destination']
vals = [(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')]
movements = spark.createDataFrame(vals,columns)

persons.show()
+---+------+----------+
| Id|JoinId|      Name|
+---+------+----------+
|  1|    11| FirstName|
|  2|    12|SecondName|
|  3|    13| ThirdName|
+---+------+----------+

movements.show()
+---+------+-------------+--------+-------------+
| Id|JoinId|Specification|    Date|  Destination|
+---+------+-------------+--------+-------------+
|  1|    10|            I|20051205|New York City|
|  2|    11|            I|19991112|       Berlin|
|  3|    11|            O|20030101|       Madrid|
|  4|    13|            I|20200113|        Paris|
|  5|    11|            U|20070806|     Lissabon|
+---+------+-------------+--------+-------------+
我想创造的是

+--------+----------+---------+---------+-----------+
|PersonId|PersonName|    IDate|    ODate|Destination|
|       1| FirstName| 19991112| 20030101|     Berlin|
|       3| ThirdName| 20200113|         |      Paris|
+--------+----------+---------+---------+-----------+

规则将是:

  • PersonId是该人的Id
  • IDate是运动数据框中保存的日期,其中规格为I
  • ODate保存在规格为O的移动数据框中的日期
  • Destination是规范所在的已联接项的目的地
  • 我已经加入了JoinId上的数据帧

    joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, col('P_JoinId') == movements.JoinId, how='inner')
    
    joined.show()
    +---+--------+---------+---+------+-------------+--------+-----------+
    | Id|P_JoinId|     Name| Id|JoinId|Specification|    Date|Destination|
    +---+--------+---------+---+------+-------------+--------+-----------+
    |  1|      11|FirstName|  2|    11|            I|19991112|     Berlin|
    |  1|      11|FirstName|  3|    11|            O|20030101|     Madrid|
    |  1|      11|FirstName|  5|    11|            U|20070806|   Lissabon|
    |  3|      13|ThirdName|  4|    13|            I|20200113|      Paris|
    +---+--------+---------+---+------+-------------+--------+-----------+
    
    但我很难从多行中选择数据,并将它们按照给定的规则放在一行中


    感谢您的帮助

    注意:我已将运动中的id重命名为id_运动,以避免以后分组时出现混淆

    您可以根据规范透视加入的数据,并对日期和目的地进行聚合。然后您将获得日期和目的地规范

    import pyspark.sql.functions as F
    persons =sqlContext.createDataFrame( [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')],schema=['Id', 'JoinId', 'Name'])
    movements=sqlContext.createDataFrame([(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')],schema=['Id_movements', 'JoinId', 'Specification', 'Date', 'Destination'])
    df_joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, F.col('P_JoinId') == movements.JoinId, how='inner')
    #%%
    df_pivot = df_joined.groupby(['Id','Name']).pivot('Specification').agg(F.min('Date').alias("date"),F.min('Destination').alias('destination'))
    
    在这里,我选择了最小聚合,但您可以根据需要选择一个,并删除不相关的列

    结果:

    +---+---------+--------+-------------+--------+-------------+--------+-------------+
    | Id|     Name|  I_date|I_destination|  O_date|O_destination|  U_date|U_destination|
    +---+---------+--------+-------------+--------+-------------+--------+-------------+
    |  1|FirstName|19991112|       Berlin|20030101|       Madrid|20070806|     Lissabon|
    |  3|ThirdName|20200113|        Paris|    null|         null|    null|         null|
    +---+---------+--------+-------------+--------+-------------+--------+-------------+
    

    您能添加所需的输出吗?它在“我想创建的是…”之后提到:)谢谢您的快速响应。我会查一查的!那很有趣。我以前从未使用过pivot。下一步如何将结果的三行(person 1)压缩为一行?只需在答案中添加结果。你不必再浓缩了。基于分组,它已经根据您选择的聚合进行压缩。可能你需要做一些列删除和重命名,不客气。也感谢您完美地构建问题,以及生成示例所需的所有代码。