Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在pyspark中运行指数加权移动平均_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何在pyspark中运行指数加权移动平均

Python 如何在pyspark中运行指数加权移动平均,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我正在尝试使用分组映射和UDF在PySpark中运行指数加权移动平均。但它不起作用: def ExpMA(myData): from pyspark.sql.functions import pandas_udf from pyspark.sql.functions import PandasUDFType from pyspark.sql import SQLContext df = myData group_col = 'Name' so

我正在尝试使用分组映射和UDF在PySpark中运行指数加权移动平均。但它不起作用:

def ExpMA(myData):

    from pyspark.sql.functions import pandas_udf
    from pyspark.sql.functions import PandasUDFType
    from pyspark.sql import SQLContext 

    df = myData
    group_col = 'Name'
    sort_col = 'Date'

    schema = df.select(group_col, sort_col,'count').schema
    print(schema)

    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def ema(pdf):
        Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
        return Model

    data = df.groupby('Name').apply(ema)

    return data

我还尝试在没有熊猫udf的情况下运行它,只是在PySpark中编写ewma方程,但问题是ewma方程包含当前ewma的滞后

首先,您的Pandas代码不正确。不管有没有火花,这都行不通

pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
另一个问题是输出模式,它取决于您的数据,无法真正容纳结果:

  • 如果要添加ewm,则应扩展模式
  • 若您只想返回ewm,那个么模式就太大了
  • 如果只想替换,它可能与类型不匹配
让我们假设这是第一个场景(我允许自己重写一点代码):


我不使用太多的熊猫,因此可能有更优雅的方法来实现这一点。

Code表示spark数据框没有功能apply
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField

def exp_ma(df, group_col='Name', sort_col='Date'):
    schema = (df.select(group_col, sort_col, 'count')
        .schema.add(StructField('ewma', DoubleType())))

    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def ema(pdf):
        pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
        return pdf

    return df.groupby('Name').apply(ema)

df = spark.createDataFrame(
    [("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)], 
    ("name", "date", "count")
)

exp_ma(df).show()
# +----+----+-----+------------------+                                            
# |Name|Date|count|              ewma|
# +----+----+-----+------------------+
# |   b|   1|   10|              10.0|
# |   b|   8|    3| 5.800000000000001|
# |   b|   9|    0|3.0526315789473686|
# |   a|   1|    1|               1.0|
# |   a|   2|    3|               2.2|
# |   a|   3|    3| 2.578947368421052|
# +----+----+-----+------------------+