Python 获取pyspark中两个特定项之间的项数组

Python 获取pyspark中两个特定项之间的项数组,python,pyspark,Python,Pyspark,我面临的一个问题是获取一个数组,其中所有条目位于两个专用条目之间。 我有一张桌子,大致如下: | Type | State | Domain | Time | |------|-------------|--------|----------| | A | eating | Fruit | 12:33:11 | | A | working | day | 12:35:12 | | A | working | day |

我面临的一个问题是获取一个数组,其中所有条目位于两个专用条目之间。 我有一张桌子,大致如下:

| Type | State       | Domain | Time     |
|------|-------------|--------|----------|
| A    | eating      | Fruit  | 12:33:11 |
| A    | working     | day    | 12:35:12 |
| A    | working     | day    | 12:44:55 |
| A    | sleep       | day    | 12:59:53 |
| A    | enjoying    | Fruit  | 08:12:04 |
| A    | thinking    | day    | 09:16:32 |
| A    | eating      | Fruit  | 10:44:31 |
| A    | daydreaming | day    | 10:44:33 |
| A    | calling     | day    | 10:59:01 |
| B    | wondering   | Fruit  | 10:00:01 |
| B    | digesting   | day    | 10:49:09 |
| B    | cleaning    | day    | 12:00:27 |
| B    | eating      | Fruit  | 04:03:22 |
| Type | State       | Domain     | Time     | Intermediate Output             | Array Count | Mode Array                 |
|------|-------------|------------|----------|---------------------------------|-------------|----------------------------|
| A    | eating      | Fruit      | 12:33:11 | ['working', 'working', 'sleep'] | 3           | working                    |
| A    | working     | day        | 12:35:12 | None                            | 0           | None                       |
| A    | working     | day        | 12:44:55 | None                            | 0           | None                       |
| A    | sleep       | day        | 12:59:53 | None                            | 0           | None                       |
| A    | enjoying    | Fruit      | 08:12:04 | ['day']                         | 1           | day                        |
| A    | thinking    | day        | 09:16:32 | None                            | 0           | None                       |
| A    | eating      | Fruit      | 10:44:31 | ['daydreaming', 'calling']      | 2           | ['daydreaming', 'calling'] |
| A    | daydreaming | day        | 10:44:33 | None                            | 0           | None                       |
| A    | calling     | day        | 10:59:01 | None                            | 0           | None                       |
| B    | wondering   | Fruit      | 10:00:01 | ['digesting','cleaning']        | 2           | ['digesting','cleaning']   |
| B    | digesting   | day        | 10:49:09 | None                            | 0           | None                       |
| B    | cleaning    | day        | 12:00:27 | None                            | 0           | None                       |
| B    | eating      | Fruit      | 04:03:22 | []                              | 0           | []                         |
我想得到如下输出:

| Type | State       | Domain | Time     |
|------|-------------|--------|----------|
| A    | eating      | Fruit  | 12:33:11 |
| A    | working     | day    | 12:35:12 |
| A    | working     | day    | 12:44:55 |
| A    | sleep       | day    | 12:59:53 |
| A    | enjoying    | Fruit  | 08:12:04 |
| A    | thinking    | day    | 09:16:32 |
| A    | eating      | Fruit  | 10:44:31 |
| A    | daydreaming | day    | 10:44:33 |
| A    | calling     | day    | 10:59:01 |
| B    | wondering   | Fruit  | 10:00:01 |
| B    | digesting   | day    | 10:49:09 |
| B    | cleaning    | day    | 12:00:27 |
| B    | eating      | Fruit  | 04:03:22 |
| Type | State       | Domain     | Time     | Intermediate Output             | Array Count | Mode Array                 |
|------|-------------|------------|----------|---------------------------------|-------------|----------------------------|
| A    | eating      | Fruit      | 12:33:11 | ['working', 'working', 'sleep'] | 3           | working                    |
| A    | working     | day        | 12:35:12 | None                            | 0           | None                       |
| A    | working     | day        | 12:44:55 | None                            | 0           | None                       |
| A    | sleep       | day        | 12:59:53 | None                            | 0           | None                       |
| A    | enjoying    | Fruit      | 08:12:04 | ['day']                         | 1           | day                        |
| A    | thinking    | day        | 09:16:32 | None                            | 0           | None                       |
| A    | eating      | Fruit      | 10:44:31 | ['daydreaming', 'calling']      | 2           | ['daydreaming', 'calling'] |
| A    | daydreaming | day        | 10:44:33 | None                            | 0           | None                       |
| A    | calling     | day        | 10:59:01 | None                            | 0           | None                       |
| B    | wondering   | Fruit      | 10:00:01 | ['digesting','cleaning']        | 2           | ['digesting','cleaning']   |
| B    | digesting   | day        | 10:49:09 | None                            | 0           | None                       |
| B    | cleaning    | day        | 12:00:27 | None                            | 0           | None                       |
| B    | eating      | Fruit      | 04:03:22 | []                              | 0           | []                         |
基本上是按类型和域进行分区,以获得两个不同域值之间的差异。域列中的值只能取两个值[水果,天]。 我基本上想要获得数组中的所有状态名,从第一次域是Fruit直到下一个Fruit条目出现。另外两列应基于此数组中间输出,以获取数组的长度及其模式。 一个类型中可以出现任意多个水果条目。 整个数据集根据时间列按时间顺序排列

不幸的是,基础设施只允许pyspark,所以我不能使用pandas

我真的很感激任何帮助和提示,因为我是一个Pypark Noob!
提前非常感谢

我使用两个窗口函数来解决这个问题。(同时,假设您的记录将按“时间”列排序)

由于您的“域”只能接受两个值,因此我将“水果”编码为1,“天”编码为0。我们将在这个新的域列上执行一个
增量求和
,将其用作组“State”的键

collect_list
函数的输出中删除第一个元素并保留其余元素。我正在使用
remove\u first\u元素UDF
来实现这一点

当“域”为“天”时,您不需要“数组_输出”。因此,只要“域”是“天”,就用
None
替换它

F.size()

构建数据帧:

from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *

schema = StructType([StructField("Type", StringType()), StructField("State", StringType()),\
         StructField("Domain", StringType()), StructField("Time", IntegerType())])

data = [['A', 'eating', 'Fruit', 1], ['A', 'working', 'day', 2], ['A', 'working', 'day', 3], ['A', 'sleep', 'day', 4], ['A', 'enjoying', 'Fruit', 5], ['A', 'thinking', 'day', 6], ['A', 'eating', 'Fruit', 7], ['A', 'daydreaming', 'day', 8], ['A', 'calling', 'day', 9], ['B', 'wondering', 'Fruit', 10], ['B', 'digesting', 'day', 11], ['B', 'cleaning', 'day', 12], ['B', 'eating', 'Fruit', 13]]

df = spark.createDataFrame(data,schema=schema)

df.show()
实际操作:

df1 = df.withColumn("Domain_num", F.when(col("Domain")=="Fruit", 1).otherwise(0))

w1=Window().partitionBy("Type").orderBy("Time")
w2=Window().partitionBy("Type", "incremental_sum")

def remove_first_element(list):
    return list[1:]

remove_first_element_udf = F.udf(remove_first_element, ArrayType(StringType()))

df1 = df1.withColumn("incremental_sum", F.sum("Domain_num").over(w1))\
        .withColumn("array_output", collect_list(col("State")).over(w2))\
        .withColumn("array_output", remove_first_element_udf(col("array_output")))\
        .withColumn("array_output", F.when(col("Domain_num")==0, None).otherwise(col("array_output")))\
        .withColumn("array_count", F.size(col("array_output")))\
        .withColumn("array_count", F.when(col("Domain_num")==0, 0).otherwise(col("array_count")))

查找模式:

from collections import Counter
def get_multi_mode_list(input_array):
    multi_mode = []
    counter_var = Counter(input_array)  
    try:
        temp = counter_var.most_common(1)[0][1]
    except:
        temp = counter_var.most_common(1)
    for i in counter_var: 
        if input_array.count(i) == temp: 
            multi_mode.append(i)
    return(list(set(multi_mode)))

get_multi_mode_list_udf = F.udf(get_multi_mode_list, ArrayType(StringType()))

df1 = df1.withColumn("multi_mode", get_multi_mode_list_udf(col("array_output")))\
        .withColumn("multi_mode", F.when(col("Domain_num")==0, None).otherwise(col("multi_mode")))\
        .drop("Domain_num", "incremental_sum")
输出:

df1.orderBy("Time").show(truncate=False)

+----+-----------+------+----+-------------------------+-----------+----------------------+
|Type|State      |Domain|Time|array_output             |array_count|multi_mode            |
+----+-----------+------+----+-------------------------+-----------+----------------------+
|A   |eating     |Fruit |1   |[working, working, sleep]|3          |[working]             |
|A   |working    |day   |2   |null                     |0          |null                  |
|A   |working    |day   |3   |null                     |0          |null                  |
|A   |sleep      |day   |4   |null                     |0          |null                  |
|A   |enjoying   |Fruit |5   |[thinking]               |1          |[thinking]            |
|A   |thinking   |day   |6   |null                     |0          |null                  |
|A   |eating     |Fruit |7   |[daydreaming, calling]   |2          |[daydreaming, calling]|
|A   |daydreaming|day   |8   |null                     |0          |null                  |
|A   |calling    |day   |9   |null                     |0          |null                  |
|B   |wondering  |Fruit |10  |[digesting, cleaning]    |2          |[digesting, cleaning] |
|B   |digesting  |day   |11  |null                     |0          |null                  |
|B   |cleaning   |day   |12  |null                     |0          |null                  |
|B   |eating     |Fruit |13  |[]                       |0          |[]                    |
+----+-----------+------+----+-------------------------+-----------+----------------------+


这太棒了!非常非常感谢!这对我帮助很大。很乐意帮忙!我添加了代码片段以了解模式。