Python Groupby并在PySpark dataframe中创建一个新列
我有一个像这样的pyspark数据框Python Groupby并在PySpark dataframe中创建一个新列,python,pyspark,Python,Pyspark,我有一个像这样的pyspark数据框 +----------+--------+ |id_ | p | +----------+--------+ | 1 | A | | 1 | B | | 1 | B | | 1 | A | | 1 | A | | 1 | B | | 2 | C | | 2
+----------+--------+
|id_ | p |
+----------+--------+
| 1 | A |
| 1 | B |
| 1 | B |
| 1 | A |
| 1 | A |
| 1 | B |
| 2 | C |
| 2 | C |
| 2 | C |
| 2 | A |
| 2 | A |
| 2 | C |
---------------------
1
2
2
3
3
4
1
1
1
2
2
3
我想为每组id\uu
创建另一列。列是使用熊猫现在与代码
sample.groupby(by=['id_'], group_keys=False).apply(lambda grp : grp['p'].ne(grp['p'].shift()).cumsum())
如何在pyspark数据帧中执行此操作
目前,我正在熊猫UDF的帮助下完成这项工作,该UDF运行速度非常慢
有什么选择
预期的列将如下所示
+----------+--------+
|id_ | p |
+----------+--------+
| 1 | A |
| 1 | B |
| 1 | B |
| 1 | A |
| 1 | A |
| 1 | B |
| 2 | C |
| 2 | C |
| 2 | C |
| 2 | A |
| 2 | A |
| 2 | C |
---------------------
1
2
2
3
3
4
1
1
1
2
2
3
您可以将自定义项和窗口功能组合起来以实现您的结果:
# required imports
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define a window, which we will use to calculate lag values
w = Window().partitionBy().orderBy(F.col('id_'))
# define user defined function (udf) to perform calculation on each row
def f(lag_val, current_val):
if lag_val != current_val:
return 1
return 0
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
# read csv file
df = spark.read.csv('/path/to/file.csv', header=True)
# create new column with lag on window we created earlier, apply udf on lagged
# and current value and then apply window function again to calculate cumsum
df.withColumn("new_column", func_udf(F.lag("p").over(w), df['p'])).withColumn('cumsum', F.sum('new_column').over(w.partitionBy(F.col('id_')).rowsBetween(Window.unboundedPreceding, 0))).show()
+---+---+----------+------+
|id_| p|new_column|cumsum|
+---+---+----------+------+
| 1| A| 1| 1|
| 1| B| 1| 2|
| 1| B| 0| 2|
| 1| A| 1| 3|
| 1| A| 0| 3|
| 1| B| 1| 4|
| 2| C| 1| 1|
| 2| C| 0| 1|
| 2| C| 0| 1|
| 2| A| 1| 2|
| 2| A| 0| 2|
| 2| C| 1| 3|
+---+---+----------+------+
# where:
# w.partitionBy : to partition by id_ column
# w.rowsBetween : to specify frame boundaries
# ref https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/expressions/Window.html#rowsBetween-long-long-
您是否尝试过pyspark.sql.udf:?我无法使用它获得所需的结果。我无法比较滞后值和当前值。也许我遗漏了一些东西。你能分享预期的结果吗?我几乎发布了相同的答案,直到我看到你已经发布了它。效果如预期。你能把我链接到除文档之外的窗口功能资源吗?你能建议一个没有自定义项的解决方案吗。?