Python 重塑pyspark数据框以显示项目交互的移动窗口_Python_Pyspark_Pyspark Dataframes

Python 重塑pyspark数据框以显示项目交互的移动窗口

python pyspark

Python 重塑pyspark数据框以显示项目交互的移动窗口,python,pyspark,pyspark-dataframes,Python,Pyspark,Pyspark Dataframes,我有一个大型的pyspark主题交互数据框架，格式很长——每行描述一个与某个感兴趣的项目交互的主题，以及该主题交互的时间戳和排名顺序（即，第一个交互是1，第二个是2，等等）。以下是几行： +----------+---------+----------------------+--------------------+ | date|itemId |interaction_date_order| userId| +----------+---------+

我有一个大型的pyspark主题交互数据框架，格式很长——每行描述一个与某个感兴趣的项目交互的主题，以及该主题交互的时间戳和排名顺序（即，第一个交互是1，第二个是2，等等）。以下是几行：

+----------+---------+----------------------+--------------------+
|      date|itemId   |interaction_date_order|              userId|
+----------+---------+----------------------+--------------------+
|2019-07-23| 10005880|                     1|37                  |
|2019-07-23| 10005903|                     2|37                  |
|2019-07-23| 10005903|                     3|37                  |
|2019-07-23| 12458442|                     4|37                  |
|2019-07-26| 10005903|                     5|37                  |
|2019-07-26| 12632813|                     6|37                  |
|2019-07-26| 12632813|                     7|37                  |
|2019-07-26| 12634497|                     8|37                  |
|2018-11-24| 12245677|                     1|5                   |
|2018-11-24| 12245677|                     1|5                   |
|2019-07-29| 12541871|                     2|5                   |
|2019-07-29| 12541871|                     3|5                   |
|2019-07-30| 12626854|                     4|5                   |
|2019-08-31| 12776880|                     5|5                   |
|2019-08-31| 12776880|                     6|5                   |
+----------+---------+----------------------+--------------------+

我需要重塑这些数据，这样，对于每个主题，一行都有一个长度为5的移动交互窗口。那么，像这样的事情：

+------+--------+--------+--------+--------+--------+
|userId| i-2    |  i-1   |   i    |    i+1 |     i+2|
+------+--------+--------+--------+--------+--------+
|37    |10005880|10005903|10005903|12458442|10005903|
|37    |10005903|10005903|12458442|10005903|12632813|

有人对我如何做有什么建议吗

导入spark和所有内容

from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

创建您的数据帧

columns = '|      date|itemId   |interaction_date_order|              userId|'.split('|')
lines = '''2019-07-23| 10005880|                     1|37                  |
2019-07-23| 10005903|                     2|37                  |
2019-07-23| 10005903|                     3|37                  |
2019-07-23| 12458442|                     4|37                  |
2019-07-26| 10005903|                     5|37                  |
2019-07-26| 12632813|                     6|37                  |
2019-07-26| 12632813|                     7|37                  |
2019-07-26| 12634497|                     8|37                  |
2018-11-24| 12245677|                     1|5                   |
2018-11-24| 12245677|                     2|5                   |
2019-07-29| 12541871|                     3|5                   |
2019-07-29| 12541871|                     4|5                   |
2019-07-30| 12626854|                     5|5                   |
2019-08-31| 12776880|                     6|5                   |
2019-08-31| 12776880|                     7|5                   |'''

Interaction = Row("date", "itemId", "interaction_date_order", "userId")
interactions = []
for line in lines.split('\n'):
    column_values = line.split('|')
    interaction = Interaction(column_values[0], int(column_values[1]), int(column_values[2]), int(column_values[3]))
    interactions.append(interaction)

df = spark.createDataFrame(interactions)

现在我们有了

df.show()

+----------+--------+----------------------+------+
|      date|  itemId|interaction_date_order|userId|
+----------+--------+----------------------+------+
|2019-07-23|10005880|                     1|    37|
|2019-07-23|10005903|                     2|    37|
|2019-07-23|10005903|                     3|    37|
|2019-07-23|12458442|                     4|    37|
|2019-07-26|10005903|                     5|    37|
|2019-07-26|12632813|                     6|    37|
|2019-07-26|12632813|                     7|    37|
|2019-07-26|12634497|                     8|    37|
|2018-11-24|12245677|                     1|     5|
|2018-11-24|12245677|                     2|     5|
|2019-07-29|12541871|                     3|     5|
|2019-07-29|12541871|                     4|     5|
|2019-07-30|12626854|                     5|     5|
|2019-08-31|12776880|                     6|     5|
|2019-08-31|12776880|                     7|     5|
+----------+--------+----------------------+------+

df_final.show()
+----------+--------+----------------------+------+--------------------+------------+
|      date|  itemId|interaction_date_order|userId|         itemId_list|itemId_count|
+----------+--------+----------------------+------+--------------------+------------+
|2018-11-24|12245677|                     1|     5|[12245677, 122456...|           5|
|2018-11-24|12245677|                     2|     5|[12245677, 125418...|           5|
|2019-07-29|12541871|                     3|     5|[12541871, 125418...|           5|
|2019-07-23|10005880|                     1|    37|[10005880, 100059...|           5|
|2019-07-23|10005903|                     2|    37|[10005903, 100059...|           5|
|2019-07-23|10005903|                     3|    37|[10005903, 124584...|           5|
|2019-07-23|12458442|                     4|    37|[12458442, 100059...|           5|
+----------+--------+----------------------+------+--------------------+------------+

创建一个窗口并收集带有count的itemId

from pyspark.sql.window import Window
import pyspark.sql.functions as F

window = Window() \
    .partitionBy('userId') \
    .orderBy('interaction_date_order') \
    .rowsBetween(Window.currentRow, Window.currentRow+4)

df2 = df.withColumn("itemId_list", F.collect_list('itemId').over(window))
df2 = df2.withColumn("itemId_count", F.count('itemId').over(window))
df_final = df2.where(df2['itemId_count'] == 5)

现在我们有了

df.show()

+----------+--------+----------------------+------+
|      date|  itemId|interaction_date_order|userId|
+----------+--------+----------------------+------+
|2019-07-23|10005880|                     1|    37|
|2019-07-23|10005903|                     2|    37|
|2019-07-23|10005903|                     3|    37|
|2019-07-23|12458442|                     4|    37|
|2019-07-26|10005903|                     5|    37|
|2019-07-26|12632813|                     6|    37|
|2019-07-26|12632813|                     7|    37|
|2019-07-26|12634497|                     8|    37|
|2018-11-24|12245677|                     1|     5|
|2018-11-24|12245677|                     2|     5|
|2019-07-29|12541871|                     3|     5|
|2019-07-29|12541871|                     4|     5|
|2019-07-30|12626854|                     5|     5|
|2019-08-31|12776880|                     6|     5|
|2019-08-31|12776880|                     7|     5|
+----------+--------+----------------------+------+

df_final.show()
+----------+--------+----------------------+------+--------------------+------------+
|      date|  itemId|interaction_date_order|userId|         itemId_list|itemId_count|
+----------+--------+----------------------+------+--------------------+------------+
|2018-11-24|12245677|                     1|     5|[12245677, 122456...|           5|
|2018-11-24|12245677|                     2|     5|[12245677, 125418...|           5|
|2019-07-29|12541871|                     3|     5|[12541871, 125418...|           5|
|2019-07-23|10005880|                     1|    37|[10005880, 100059...|           5|
|2019-07-23|10005903|                     2|    37|[10005903, 100059...|           5|
|2019-07-23|10005903|                     3|    37|[10005903, 124584...|           5|
|2019-07-23|12458442|                     4|    37|[12458442, 100059...|           5|
+----------+--------+----------------------+------+--------------------+------------+

最后一击

df_final2 = (df_final
             .withColumn('i-2', df_final['itemId_list'][0])
             .withColumn('i-1', df_final['itemId_list'][1])
             .withColumn('i', df_final['itemId_list'][2])
             .withColumn('i+1', df_final['itemId_list'][3])
             .withColumn('i+2', df_final['itemId_list'][4])
             .select('userId', 'i-2', 'i-1', 'i', 'i+1', 'i+2')
            )
df_final2.show()
+------+--------+--------+--------+--------+--------+                           
|userId|     i-2|     i-1|       i|     i+1|     i+2|
+------+--------+--------+--------+--------+--------+
|     5|12245677|12245677|12541871|12541871|12626854|
|     5|12245677|12541871|12541871|12626854|12776880|
|     5|12541871|12541871|12626854|12776880|12776880|
|    37|10005880|10005903|10005903|12458442|10005903|
|    37|10005903|10005903|12458442|10005903|12632813|
|    37|10005903|12458442|10005903|12632813|12632813|
|    37|12458442|10005903|12632813|12632813|12634497|
+------+--------+--------+--------+--------+--------+

重塑后，您希望userid37有多少行？4？是的，没错！所以，下面的答案应该是正确的，请将其标记为接受答案并投票：）是的，现在检查一下！必须先启动我的集群！回答得很好！感谢您的帮助，并逐步了解解决方案！