取决于值条件pyspark的字典数组总和(spark结构化流)

取决于值条件pyspark的字典数组总和(spark结构化流),pyspark,spark-structured-streaming,Pyspark,Spark Structured Streaming,我有以下模式 tick_by_tick_schema = StructType([ StructField('localSymbol', StringType()), StructField('time', StringType()), StructField('open', StringType()), StructField('previous_price', StringType

我有以下模式

        tick_by_tick_schema = StructType([
            StructField('localSymbol', StringType()),
            StructField('time', StringType()),
            StructField('open', StringType()),
            StructField('previous_price', StringType()),
            StructField('tickByTicks', ArrayType(StructType([
                StructField('price', StringType()),
                StructField('size', StringType()),
                StructField('specialConditions', StringType()),
            ])))
        ])
我有以下数据帧(在spark结构化流媒体中):

我想根据下一个逻辑创建两列:

Column_low: WHEN tickByTicks.price < previous_price THEN sum(tickByTicks.size)
Column_high: when tickByTicks.price > previous_price THEN sum(tickByTicks.size)
我也尝试过做类似的事情,但没有达到预期的效果

        tick_by_tick_data_processed = kafka_df_structured_with_tick_by_tick_data_values.select(
            f.col('localSymbol'),
            f.col('time'),
            f.col('previous_price'),
            f.col('tickByTicks'),
            f.expr("aggregate(filter(tickByTicks.size, x -> x > previous_price), 0D, (x, acc) -> acc + x)")
        ).show(30,False)

我无法测试我的解决方案,但我认为这可能有效:

tick_by_tick_data_processed=kafka_df_structured_with_tick_by_tick_data_value.select(
f、 col('localSymbol'),
f、 col(‘时间’),
f、 col(“以前的价格”),
f、 col(‘滴答声’),
f、 expr(“聚合(滴答滴答,0D,(acc,滴答)->如果(滴答价格<先前的价格,acc+滴答大小,acc)))。别名(“列低”),
f、 expr(“聚合(滴答滴答,0D,(acc,滴答)->如果(滴答价格>上一个滴答价格,acc+滴答大小,acc)))。别名(“列高”)

此功能使用分解和求和功能

from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import explode

data = [
           ("BABA", "2021-06-10 19:25:38.154245+00:00" ,"213.76" ,[("213.75", "100")] ),
           ("BABA", "2021-06-10 19:25:38.155229+00:00" ,"213.76" ,[("213.75", "100"),("213.78", "100"),("213.78", "200")] ),
           ("BABA", "2021-06-10 19:25:39.662033+00:00" ,"213.73" ,[("213.72", "100")]  ),
           ("BABA", "2021-06-10 19:25:39.662655+00:00" ,"213.72" ,[("213.72", "100"),("213.73", "100")]  ),
            ]

tick_by_tick_schema = StructType([
    StructField('localSymbol', StringType()),
    StructField('time', StringType()),
    StructField('previous_price', StringType()),
    StructField('tickByTicks', ArrayType(StructType([
        StructField('price', StringType()),
        StructField('size', StringType())
    ])))
])

df = spark.createDataFrame(data=data, schema=tick_by_tick_schema)
df = df.withColumn("idx", monotonically_increasing_id())
df=df.withColumn("col3", explode(df.tickByTicks))
df.createOrReplaceTempView("calc")
spark.sql("select localSymbol,time,previous_price,idx,tickByTicks, sum (case when col3.price < previous_price then col3.size else 0 end) as Column_low ,sum(case when col3.price > previous_price then col3.size else 0 end) as Column_low from calc group by localSymbol,time,previous_price,idx,tickByTicks ").drop("idx").show(truncate=0)
        tick_by_tick_data_processed = kafka_df_structured_with_tick_by_tick_data_values.select(
            f.col('localSymbol'),
            f.col('time'),
            f.col('previous_price'),
            f.col('tickByTicks'),
            f.expr("aggregate(filter(tickByTicks.size, x -> x > previous_price), 0D, (x, acc) -> acc + x)")
        ).show(30,False)
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import explode

data = [
           ("BABA", "2021-06-10 19:25:38.154245+00:00" ,"213.76" ,[("213.75", "100")] ),
           ("BABA", "2021-06-10 19:25:38.155229+00:00" ,"213.76" ,[("213.75", "100"),("213.78", "100"),("213.78", "200")] ),
           ("BABA", "2021-06-10 19:25:39.662033+00:00" ,"213.73" ,[("213.72", "100")]  ),
           ("BABA", "2021-06-10 19:25:39.662655+00:00" ,"213.72" ,[("213.72", "100"),("213.73", "100")]  ),
            ]

tick_by_tick_schema = StructType([
    StructField('localSymbol', StringType()),
    StructField('time', StringType()),
    StructField('previous_price', StringType()),
    StructField('tickByTicks', ArrayType(StructType([
        StructField('price', StringType()),
        StructField('size', StringType())
    ])))
])

df = spark.createDataFrame(data=data, schema=tick_by_tick_schema)
df = df.withColumn("idx", monotonically_increasing_id())
df=df.withColumn("col3", explode(df.tickByTicks))
df.createOrReplaceTempView("calc")
spark.sql("select localSymbol,time,previous_price,idx,tickByTicks, sum (case when col3.price < previous_price then col3.size else 0 end) as Column_low ,sum(case when col3.price > previous_price then col3.size else 0 end) as Column_low from calc group by localSymbol,time,previous_price,idx,tickByTicks ").drop("idx").show(truncate=0)
+-----------+--------------------------------+--------------+---------------------------------------------+----------+----------+
|localSymbol|time                            |previous_price|tickByTicks                                  |Column_low|Column_low|
+-----------+--------------------------------+--------------+---------------------------------------------+----------+----------+
|BABA       |2021-06-10 19:25:39.662033+00:00|213.73        |[[213.72, 100]]                              |100.0     |0.0       |
|BABA       |2021-06-10 19:25:38.154245+00:00|213.76        |[[213.75, 100]]                              |100.0     |0.0       |
|BABA       |2021-06-10 19:25:39.662655+00:00|213.72        |[[213.72, 100], [213.73, 100]]               |0.0       |100.0     |
|BABA       |2021-06-10 19:25:38.155229+00:00|213.76        |[[213.75, 100], [213.78, 100], [213.78, 200]]|100.0     |300.0     |
+-----------+--------------------------------+--------------+---------------------------------------------+----------+----------+

>>>