创建另一列以检查pyspark中的不同值_Pyspark

创建另一列以检查pyspark中的不同值

pyspark

创建另一列以检查pyspark中的不同值,pyspark,Pyspark,我希望获得以下预期产出： id values sign numbering 0 0 10 1 1 1 1 5 1 1 2 2 3 1 1 3 3 -1 -1 2 4 4 0 0 3 5 5 -10 -1 4 6 6 -4 -1 4 7 7 10 1 5 8 8 0 0 6 9 9 10 1 7 我的代码： import numpy as

我希望获得以下预期产出：

    id  values  sign    numbering
0   0   10  1   1
1   1   5   1   1
2   2   3   1   1
3   3   -1  -1  2
4   4   0   0   3
5   5   -10 -1  4
6   6   -4  -1  4
7   7   10  1   5
8   8   0   0   6
9   9   10  1   7

我的代码：

import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
                             'values': [10,5,3,-1,0,-10,-4,10,0,10]})

sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()

我想创建另一个列，当值与前一行不同时，该列将额外返回1

预期产出：

    id  values  sign    numbering
0   0   10  1   1
1   1   5   1   1
2   2   3   1   1
3   3   -1  -1  2
4   4   0   0   3
5   5   -10 -1  4
6   6   -4  -1  4
7   7   10  1   5
8   8   0   0   6
9   9   10  1   7

下面是一种使用自定义函数的方法：

import pyspark.sql.functions as F

# compare the next value with previous
def f(x):
    c = 1
    l = [c]
    last_value = [x[0]]
    for i in x[1:]:
        if i == last_value[-1]:
            l.append(c)
        else:
            c += 1
            l.append(c)
        last_value.append(i)
    return l

# take sign column as a list
sign_list = sp_dataframe.select('sign').rdd.map(lambda x: x.sign).collect()

# create a new dataframe using the output
sp = spark.createDataFrame(pd.DataFrame(f(sign_list), columns=['numbering']))

在pyspark中，将列表作为列附加到数据帧有点棘手。为此，我们需要创建一个虚拟的

行\u idx

来加入数据帧

# create dummy indexes
sp_dataframe = sp_dataframe.withColumn("row_idx", F.monotonically_increasing_id())
sp = sp.withColumn("row_idx", F.monotonically_increasing_id())

# join the dataframes
final_df = (sp_dataframe
            .join(sp, sp_dataframe.row_idx == sp.row_idx)
            .orderBy('id')
            .drop("row_idx"))

final_df.show()

+---+------+----+---------+
| id|values|sign|numbering|
+---+------+----+---------+
|  0|    10|   1|        1|
|  1|     5|   1|        1|
|  2|     3|   1|        1|
|  3|    -1|  -1|        2|
|  4|     0|   0|        3|
|  5|   -10|  -1|        4|
|  6|    -4|  -1|        4|
|  7|    10|   1|        5|
|  8|     0|   0|        6|
|  9|    10|   1|        7|
+---+------+----+---------+

非常感谢。复习并理解后，我会将其标记为答案。您好，您能解释一下为什么我们需要创建虚拟行idx吗？因为我们想将新列垂直附加到

sp_dataframe

，为此，我们在两个数据框中都创建了索引（或键），以便我们可以使用这些键进行连接，这类似于在

sp

dataframe中创建一个新的

id

列，并在该列上进行连接。因此，“row_idx”用作索引，而代码“sp_dataframe.row_idx==sp.row_idx”用于此目的，对吗？因为你最后把它放下了，所以我们只在“sp”中留下了编号列，对吗？我很好奇，我们不能直接使用F.udf并分配函数吗？比如--increasing_number=F.udf（lambda x:int（F（x）），IntegerType（）--sp_dataframe=sp_dataframe.withColumn（“编号”），increasing_number（“符号”）--这是我第一次尝试，但结果是udf函数将列的每个值作为输入，而不是一次将整个列作为输入，在您的情况下，函数的输入是一个

列表

，udf将采用

整数

（符号列中的每个值）作为输入