Dataframe pyspark自增列

Dataframe pyspark自增列,dataframe,pyspark,Dataframe,Pyspark,我有一个pyspark数据帧,格式如下 表A: +----+--------+------+-------------+ | ID | date | type | description | +----+--------+------+-------------+ | 1 | 201905 | A | descA | | 2 | 202006 | B | descB | | 3 | 201503 | C

我有一个pyspark数据帧,格式如下

表A:

 +----+--------+------+-------------+
    | ID |  date  | type | description |
    +----+--------+------+-------------+
    |  1 | 201905 | A    | descA       |
    |  2 | 202006 | B    | descB       |
    |  3 | 201503 | C    | descC       |
    |  4 | 201507 | D    | descD       |
    |  5 | 201601 | E    | descE       |
    |  6 | 201809 | F    | descF       |
    |  7 | 201011 | G    | descG       |
    +----+--------+------+-------------+
我有另一个表B需要附加到表A。此表没有ID列。 表B

表B需要附加到表A中,并且ID列必须为每个附加条目自动递增1,如下所示

输出表:

+----+--------+------+-------------+
| ID |  date  | type | description |
+----+--------+------+-------------+
|  1 | 201905 | A    | descA       |
|  2 | 202006 | B    | descB       |
|  3 | 201503 | C    | descC       |
|  4 | 201507 | D    | descD       |
|  5 | 201601 | E    | descE       |
|  6 | 201809 | F    | descF       |
|  7 | 201011 | G    | descG       |
|  8 | 201001 | H    | descH       |
|  9 | 201507 | I    | descI       |
| 10 | 201907 | J    | descJ       |
+----+--------+------+-------------+
您能告诉我如何使用Pyspark完成此操作吗


谢谢。

您可以从1开始将行号分配给
表B
,然后将
表A
中的
ID
列的最大值添加到其中。因此
表B的第一行变成
8
(1+7)

对于union,请使用,因为列顺序将不同。它通过列名(而不是位置)联合数据帧


您可以从1开始将行号分配给
表B
,然后将
表A
中的
ID
列的最大值添加到其中。因此
表B的第一行变成
8
(1+7)

对于union,请使用,因为列顺序将不同。它通过列名(而不是位置)联合数据帧

+----+--------+------+-------------+
| ID |  date  | type | description |
+----+--------+------+-------------+
|  1 | 201905 | A    | descA       |
|  2 | 202006 | B    | descB       |
|  3 | 201503 | C    | descC       |
|  4 | 201507 | D    | descD       |
|  5 | 201601 | E    | descE       |
|  6 | 201809 | F    | descF       |
|  7 | 201011 | G    | descG       |
|  8 | 201001 | H    | descH       |
|  9 | 201507 | I    | descI       |
| 10 | 201907 | J    | descJ       |
+----+--------+------+-------------+
from pyspark.sql import functions as F
from pyspark.sql.window import Window

max_id = df_a.agg({"ID": "max"}).collect()[0][0]

w = Window().orderBy("date", "type")
df_a.unionByName(df_b.withColumn("ID", (max_id + F.row_number().over(w)).cast("int"))).show()

+---+------+----+-----------+
| ID|  date|type|description|
+---+------+----+-----------+
|  1|201905|   A|      descA|
|  2|202006|   B|      descB|
|  3|201503|   C|      descC|
|  4|201507|   D|      descD|
|  5|201601|   E|      descE|
|  6|201809|   F|      descF|
|  7|201011|   G|      descG|
|  8|201001|   H|      descH|
|  9|201507|   I|      descI|
| 10|201907|   J|      descJ|
+---+------+----+-----------+