Dataframe pyspark自增列
我有一个pyspark数据帧,格式如下 表A:Dataframe pyspark自增列,dataframe,pyspark,Dataframe,Pyspark,我有一个pyspark数据帧,格式如下 表A: +----+--------+------+-------------+ | ID | date | type | description | +----+--------+------+-------------+ | 1 | 201905 | A | descA | | 2 | 202006 | B | descB | | 3 | 201503 | C
+----+--------+------+-------------+
| ID | date | type | description |
+----+--------+------+-------------+
| 1 | 201905 | A | descA |
| 2 | 202006 | B | descB |
| 3 | 201503 | C | descC |
| 4 | 201507 | D | descD |
| 5 | 201601 | E | descE |
| 6 | 201809 | F | descF |
| 7 | 201011 | G | descG |
+----+--------+------+-------------+
我有另一个表B需要附加到表A。此表没有ID列。
表B
表B需要附加到表A中,并且ID列必须为每个附加条目自动递增1,如下所示
输出表:
+----+--------+------+-------------+
| ID | date | type | description |
+----+--------+------+-------------+
| 1 | 201905 | A | descA |
| 2 | 202006 | B | descB |
| 3 | 201503 | C | descC |
| 4 | 201507 | D | descD |
| 5 | 201601 | E | descE |
| 6 | 201809 | F | descF |
| 7 | 201011 | G | descG |
| 8 | 201001 | H | descH |
| 9 | 201507 | I | descI |
| 10 | 201907 | J | descJ |
+----+--------+------+-------------+
您能告诉我如何使用Pyspark完成此操作吗
谢谢。您可以从1开始将行号分配给
表B
,然后将表A
中的ID
列的最大值添加到其中。因此表B的第一行变成8
(1+7)
对于union,请使用,因为列顺序将不同。它通过列名(而不是位置)联合数据帧
您可以从1开始将行号分配给表B
,然后将表A
中的ID
列的最大值添加到其中。因此表B的第一行变成8
(1+7)
对于union,请使用,因为列顺序将不同。它通过列名(而不是位置)联合数据帧
+----+--------+------+-------------+
| ID | date | type | description |
+----+--------+------+-------------+
| 1 | 201905 | A | descA |
| 2 | 202006 | B | descB |
| 3 | 201503 | C | descC |
| 4 | 201507 | D | descD |
| 5 | 201601 | E | descE |
| 6 | 201809 | F | descF |
| 7 | 201011 | G | descG |
| 8 | 201001 | H | descH |
| 9 | 201507 | I | descI |
| 10 | 201907 | J | descJ |
+----+--------+------+-------------+
from pyspark.sql import functions as F
from pyspark.sql.window import Window
max_id = df_a.agg({"ID": "max"}).collect()[0][0]
w = Window().orderBy("date", "type")
df_a.unionByName(df_b.withColumn("ID", (max_id + F.row_number().over(w)).cast("int"))).show()
+---+------+----+-----------+
| ID| date|type|description|
+---+------+----+-----------+
| 1|201905| A| descA|
| 2|202006| B| descB|
| 3|201503| C| descC|
| 4|201507| D| descD|
| 5|201601| E| descE|
| 6|201809| F| descF|
| 7|201011| G| descG|
| 8|201001| H| descH|
| 9|201507| I| descI|
| 10|201907| J| descJ|
+---+------+----+-----------+