Python 获取pyspark中两个特定项之间的项数组
我面临的一个问题是获取一个数组,其中所有条目位于两个专用条目之间。 我有一张桌子,大致如下:Python 获取pyspark中两个特定项之间的项数组,python,pyspark,Python,Pyspark,我面临的一个问题是获取一个数组,其中所有条目位于两个专用条目之间。 我有一张桌子,大致如下: | Type | State | Domain | Time | |------|-------------|--------|----------| | A | eating | Fruit | 12:33:11 | | A | working | day | 12:35:12 | | A | working | day |
| Type | State | Domain | Time |
|------|-------------|--------|----------|
| A | eating | Fruit | 12:33:11 |
| A | working | day | 12:35:12 |
| A | working | day | 12:44:55 |
| A | sleep | day | 12:59:53 |
| A | enjoying | Fruit | 08:12:04 |
| A | thinking | day | 09:16:32 |
| A | eating | Fruit | 10:44:31 |
| A | daydreaming | day | 10:44:33 |
| A | calling | day | 10:59:01 |
| B | wondering | Fruit | 10:00:01 |
| B | digesting | day | 10:49:09 |
| B | cleaning | day | 12:00:27 |
| B | eating | Fruit | 04:03:22 |
| Type | State | Domain | Time | Intermediate Output | Array Count | Mode Array |
|------|-------------|------------|----------|---------------------------------|-------------|----------------------------|
| A | eating | Fruit | 12:33:11 | ['working', 'working', 'sleep'] | 3 | working |
| A | working | day | 12:35:12 | None | 0 | None |
| A | working | day | 12:44:55 | None | 0 | None |
| A | sleep | day | 12:59:53 | None | 0 | None |
| A | enjoying | Fruit | 08:12:04 | ['day'] | 1 | day |
| A | thinking | day | 09:16:32 | None | 0 | None |
| A | eating | Fruit | 10:44:31 | ['daydreaming', 'calling'] | 2 | ['daydreaming', 'calling'] |
| A | daydreaming | day | 10:44:33 | None | 0 | None |
| A | calling | day | 10:59:01 | None | 0 | None |
| B | wondering | Fruit | 10:00:01 | ['digesting','cleaning'] | 2 | ['digesting','cleaning'] |
| B | digesting | day | 10:49:09 | None | 0 | None |
| B | cleaning | day | 12:00:27 | None | 0 | None |
| B | eating | Fruit | 04:03:22 | [] | 0 | [] |
我想得到如下输出:
| Type | State | Domain | Time |
|------|-------------|--------|----------|
| A | eating | Fruit | 12:33:11 |
| A | working | day | 12:35:12 |
| A | working | day | 12:44:55 |
| A | sleep | day | 12:59:53 |
| A | enjoying | Fruit | 08:12:04 |
| A | thinking | day | 09:16:32 |
| A | eating | Fruit | 10:44:31 |
| A | daydreaming | day | 10:44:33 |
| A | calling | day | 10:59:01 |
| B | wondering | Fruit | 10:00:01 |
| B | digesting | day | 10:49:09 |
| B | cleaning | day | 12:00:27 |
| B | eating | Fruit | 04:03:22 |
| Type | State | Domain | Time | Intermediate Output | Array Count | Mode Array |
|------|-------------|------------|----------|---------------------------------|-------------|----------------------------|
| A | eating | Fruit | 12:33:11 | ['working', 'working', 'sleep'] | 3 | working |
| A | working | day | 12:35:12 | None | 0 | None |
| A | working | day | 12:44:55 | None | 0 | None |
| A | sleep | day | 12:59:53 | None | 0 | None |
| A | enjoying | Fruit | 08:12:04 | ['day'] | 1 | day |
| A | thinking | day | 09:16:32 | None | 0 | None |
| A | eating | Fruit | 10:44:31 | ['daydreaming', 'calling'] | 2 | ['daydreaming', 'calling'] |
| A | daydreaming | day | 10:44:33 | None | 0 | None |
| A | calling | day | 10:59:01 | None | 0 | None |
| B | wondering | Fruit | 10:00:01 | ['digesting','cleaning'] | 2 | ['digesting','cleaning'] |
| B | digesting | day | 10:49:09 | None | 0 | None |
| B | cleaning | day | 12:00:27 | None | 0 | None |
| B | eating | Fruit | 04:03:22 | [] | 0 | [] |
基本上是按类型和域进行分区,以获得两个不同域值之间的差异。域列中的值只能取两个值[水果,天]。
我基本上想要获得数组中的所有状态名,从第一次域是Fruit直到下一个Fruit条目出现。另外两列应基于此数组中间输出,以获取数组的长度及其模式。
一个类型中可以出现任意多个水果条目。
整个数据集根据时间列按时间顺序排列
不幸的是,基础设施只允许pyspark,所以我不能使用pandas
我真的很感激任何帮助和提示,因为我是一个Pypark Noob!
提前非常感谢 我使用两个窗口函数来解决这个问题。(同时,假设您的记录将按“时间”列排序) 由于您的“域”只能接受两个值,因此我将“水果”编码为1,“天”编码为0。我们将在这个新的域列上执行一个
增量求和
,将其用作组“State”的键
从collect_list
函数的输出中删除第一个元素并保留其余元素。我正在使用remove\u first\u元素UDF
来实现这一点
当“域”为“天”时,您不需要“数组_输出”。因此,只要“域”是“天”,就用None
替换它
F.size()
构建数据帧:
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
schema = StructType([StructField("Type", StringType()), StructField("State", StringType()),\
StructField("Domain", StringType()), StructField("Time", IntegerType())])
data = [['A', 'eating', 'Fruit', 1], ['A', 'working', 'day', 2], ['A', 'working', 'day', 3], ['A', 'sleep', 'day', 4], ['A', 'enjoying', 'Fruit', 5], ['A', 'thinking', 'day', 6], ['A', 'eating', 'Fruit', 7], ['A', 'daydreaming', 'day', 8], ['A', 'calling', 'day', 9], ['B', 'wondering', 'Fruit', 10], ['B', 'digesting', 'day', 11], ['B', 'cleaning', 'day', 12], ['B', 'eating', 'Fruit', 13]]
df = spark.createDataFrame(data,schema=schema)
df.show()
实际操作:
df1 = df.withColumn("Domain_num", F.when(col("Domain")=="Fruit", 1).otherwise(0))
w1=Window().partitionBy("Type").orderBy("Time")
w2=Window().partitionBy("Type", "incremental_sum")
def remove_first_element(list):
return list[1:]
remove_first_element_udf = F.udf(remove_first_element, ArrayType(StringType()))
df1 = df1.withColumn("incremental_sum", F.sum("Domain_num").over(w1))\
.withColumn("array_output", collect_list(col("State")).over(w2))\
.withColumn("array_output", remove_first_element_udf(col("array_output")))\
.withColumn("array_output", F.when(col("Domain_num")==0, None).otherwise(col("array_output")))\
.withColumn("array_count", F.size(col("array_output")))\
.withColumn("array_count", F.when(col("Domain_num")==0, 0).otherwise(col("array_count")))
查找模式:
from collections import Counter
def get_multi_mode_list(input_array):
multi_mode = []
counter_var = Counter(input_array)
try:
temp = counter_var.most_common(1)[0][1]
except:
temp = counter_var.most_common(1)
for i in counter_var:
if input_array.count(i) == temp:
multi_mode.append(i)
return(list(set(multi_mode)))
get_multi_mode_list_udf = F.udf(get_multi_mode_list, ArrayType(StringType()))
df1 = df1.withColumn("multi_mode", get_multi_mode_list_udf(col("array_output")))\
.withColumn("multi_mode", F.when(col("Domain_num")==0, None).otherwise(col("multi_mode")))\
.drop("Domain_num", "incremental_sum")
输出:
df1.orderBy("Time").show(truncate=False)
+----+-----------+------+----+-------------------------+-----------+----------------------+
|Type|State |Domain|Time|array_output |array_count|multi_mode |
+----+-----------+------+----+-------------------------+-----------+----------------------+
|A |eating |Fruit |1 |[working, working, sleep]|3 |[working] |
|A |working |day |2 |null |0 |null |
|A |working |day |3 |null |0 |null |
|A |sleep |day |4 |null |0 |null |
|A |enjoying |Fruit |5 |[thinking] |1 |[thinking] |
|A |thinking |day |6 |null |0 |null |
|A |eating |Fruit |7 |[daydreaming, calling] |2 |[daydreaming, calling]|
|A |daydreaming|day |8 |null |0 |null |
|A |calling |day |9 |null |0 |null |
|B |wondering |Fruit |10 |[digesting, cleaning] |2 |[digesting, cleaning] |
|B |digesting |day |11 |null |0 |null |
|B |cleaning |day |12 |null |0 |null |
|B |eating |Fruit |13 |[] |0 |[] |
+----+-----------+------+----+-------------------------+-----------+----------------------+
这太棒了!非常非常感谢!这对我帮助很大。很乐意帮忙!我添加了代码片段以了解模式。