Pyspark 将数据从多行选择到一行
我对函数式编程和pyspark非常陌生,目前我很难从源数据中压缩所需的数据 假设我有两个表作为数据帧:Pyspark 将数据从多行选择到一行,pyspark,Pyspark,我对函数式编程和pyspark非常陌生,目前我很难从源数据中压缩所需的数据 假设我有两个表作为数据帧: # if not already created automatically, instantiate Sparkcontext spark = SparkSession.builder.getOrCreate() columns = ['Id', 'JoinId', 'Name'] vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['Id', 'JoinId', 'Name']
vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')]
persons = spark.createDataFrame(vals,columns)
columns = ['Id', 'JoinId', 'Specification', 'Date', 'Destination']
vals = [(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')]
movements = spark.createDataFrame(vals,columns)
persons.show()
+---+------+----------+
| Id|JoinId| Name|
+---+------+----------+
| 1| 11| FirstName|
| 2| 12|SecondName|
| 3| 13| ThirdName|
+---+------+----------+
movements.show()
+---+------+-------------+--------+-------------+
| Id|JoinId|Specification| Date| Destination|
+---+------+-------------+--------+-------------+
| 1| 10| I|20051205|New York City|
| 2| 11| I|19991112| Berlin|
| 3| 11| O|20030101| Madrid|
| 4| 13| I|20200113| Paris|
| 5| 11| U|20070806| Lissabon|
+---+------+-------------+--------+-------------+
我想创造的是
+--------+----------+---------+---------+-----------+
|PersonId|PersonName| IDate| ODate|Destination|
| 1| FirstName| 19991112| 20030101| Berlin|
| 3| ThirdName| 20200113| | Paris|
+--------+----------+---------+---------+-----------+
规则将是:
joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, col('P_JoinId') == movements.JoinId, how='inner')
joined.show()
+---+--------+---------+---+------+-------------+--------+-----------+
| Id|P_JoinId| Name| Id|JoinId|Specification| Date|Destination|
+---+--------+---------+---+------+-------------+--------+-----------+
| 1| 11|FirstName| 2| 11| I|19991112| Berlin|
| 1| 11|FirstName| 3| 11| O|20030101| Madrid|
| 1| 11|FirstName| 5| 11| U|20070806| Lissabon|
| 3| 13|ThirdName| 4| 13| I|20200113| Paris|
+---+--------+---------+---+------+-------------+--------+-----------+
但我很难从多行中选择数据,并将它们按照给定的规则放在一行中
感谢您的帮助注意:我已将运动中的id重命名为id_运动,以避免以后分组时出现混淆 您可以根据规范透视加入的数据,并对日期和目的地进行聚合。然后您将获得日期和目的地规范
import pyspark.sql.functions as F
persons =sqlContext.createDataFrame( [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')],schema=['Id', 'JoinId', 'Name'])
movements=sqlContext.createDataFrame([(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')],schema=['Id_movements', 'JoinId', 'Specification', 'Date', 'Destination'])
df_joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, F.col('P_JoinId') == movements.JoinId, how='inner')
#%%
df_pivot = df_joined.groupby(['Id','Name']).pivot('Specification').agg(F.min('Date').alias("date"),F.min('Destination').alias('destination'))
在这里,我选择了最小聚合,但您可以根据需要选择一个,并删除不相关的列
结果:
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| Id| Name| I_date|I_destination| O_date|O_destination| U_date|U_destination|
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| 1|FirstName|19991112| Berlin|20030101| Madrid|20070806| Lissabon|
| 3|ThirdName|20200113| Paris| null| null| null| null|
+---+---------+--------+-------------+--------+-------------+--------+-------------+
您能添加所需的输出吗?它在“我想创建的是…”之后提到:)谢谢您的快速响应。我会查一查的!那很有趣。我以前从未使用过pivot。下一步如何将结果的三行(person 1)压缩为一行?只需在答案中添加结果。你不必再浓缩了。基于分组,它已经根据您选择的聚合进行压缩。可能你需要做一些列删除和重命名,不客气。也感谢您完美地构建问题,以及生成示例所需的所有代码。