Dataframe 使用.NET for Spark对数据帧进行递归计算

Dataframe 使用.NET for Spark对数据帧进行递归计算,dataframe,apache-spark,recursion,window-functions,.net-spark,Dataframe,Apache Spark,Recursion,Window Functions,.net Spark,我想用.NET计算Spark RSI的公式为: RSI = 100 - 100/(1 + R)S RS = Average Gain / Average Loss 第一个平均收益和平均损失是14个周期的平均值: First Average Gain = Sum of Gains over the past 14 periods / 14. First Average Loss = Sum of Losses over the past 14 periods / 14 接下来的所有计算均基于

我想用.NET计算Spark

RSI的公式为:

RSI = 100 - 100/(1 + R)S

RS = Average Gain / Average Loss
第一个平均收益和平均损失是14个周期的平均值:

First Average Gain = Sum of Gains over the past 14 periods / 14.
First Average Loss = Sum of Losses over the past 14 periods / 14
接下来的所有计算均基于先前的平均值 以及电流增益损耗:

Average Gain = [(previous Average Gain) x 13 + current Gain] / 14.

Average Loss = [(previous Average Loss) x 13 + current Loss] / 14.
数据位于
DataFrame rsiCalcPos5
中,如下所示:

+--------------------+-----+------+----+-----+------------------+-------------------+----------+------------------+--------------------+-------------------+-------------------+------------------+
|      TimeSeriesType|Year0|Month0|Day0|Hour0|        avg(Value)|          Timestamp|  UnixTime|         nextValue|          deltaValue|               gain|               loss|             gain1|
+--------------------+-----+------+----+-----+------------------+-------------------+----------+------------------+--------------------+-------------------+-------------------+------------------+
|Current Available...| 2021|     3|   3|    9| 219.8235294117647|2021-03-03 09:00:00|1614758400|218.59733449857987| -1.2261949131848269|                0.0| 1.2261949131848269|               0.0|
|Current Available...| 2021|     3|   3|   10|218.59733449857987|2021-03-03 10:00:00|1614762000|185.59442632671212| -33.002908171867745|                0.0| 33.002908171867745|               0.0|
|Current Available...| 2021|     3|   3|   11|185.59442632671212|2021-03-03 11:00:00|1614765600| 190.5523781944545|   4.957951867742366|  4.957951867742366|                0.0|1.6526506225807889|
|Current Available...| 2021|     3|   3|   12| 190.5523781944545|2021-03-03 12:00:00|1614769200|187.88173813444055| -2.6706400600139375|                0.0| 2.6706400600139375|1.2394879669355916|
|Current Available...| 2021|     3|   3|   13|187.88173813444055|2021-03-03 13:00:00|1614772800| 187.6245558053521|-0.25718232908846517|                0.0|0.25718232908846517|0.9915903735484732|
|Current Available...| 2021|     3|   3|   14| 187.6245558053521|2021-03-03 14:00:00|1614776400|186.56644553819817| -1.0581102671539213|                0.0| 1.0581102671539213|0.8263253112903944|
|Current Available...| 2021|     3|   3|   15|186.56644553819817|2021-03-03 15:00:00|1614780000|186.66761484852796| 0.10116931032979437|0.10116931032979437|                0.0|0.7227315968674516|
|Current Available...| 2021|     3|   3|   16|186.66761484852796|2021-03-03 16:00:00|1614783600|165.79466929911155| -20.872945549416414|                0.0| 20.872945549416414|0.6323901472590201|
|Current Available...| 2021|     3|   3|   17|165.79466929911155|2021-03-03 17:00:00|1614787200|178.60478239401849|  12.810113094906939| 12.810113094906939|                0.0|1.9854704747754555|
|Current Available...| 2021|     3|   3|   18|178.60478239401849|2021-03-03 18:00:00|1614790800| 215.3916108565386|  36.786828462520106| 36.786828462520106|                0.0| 5.465606273549921|
|Current Available...| 2021|     3|   3|   19| 215.3916108565386|2021-03-03 19:00:00|1614794400|221.27369459516595|   5.882083738627358|  5.882083738627358|                0.0| 5.503467861284233|
|Current Available...| 2021|     3|   3|   20|221.27369459516595|2021-03-03 20:00:00|1614798000|231.88854705635575|  10.614852461189798| 10.614852461189798|                0.0|  5.92941657794303|
|Current Available...| 2021|     3|   3|   21|231.88854705635575|2021-03-03 21:00:00|1614801600|238.82354991634134|  6.9350028599855875| 6.9350028599855875|                0.0| 6.006769368869381|
|Current Available...| 2021|     3|   3|   22|238.82354991634134|2021-03-03 22:00:00|1614805200|240.02948909258865|  1.2059391762473126| 1.2059391762473126|                0.0| 5.663852926539233|
|Current Available...| 2021|     3|   3|   23|240.02948909258865|2021-03-03 23:00:00|1614808800|240.92351533915001|  0.8940262465613671| 0.8940262465613671|                0.0|              null|
|Current Available...| 2021|     3|   4|    0|240.92351533915001|2021-03-04 00:00:00|1614812400|239.63160854893138| -1.2919067902186328|                0.0| 1.2919067902186328|              null|
|Current Available...| 2021|     3|   4|    1|239.63160854893138|2021-03-04 01:00:00|1614816000|240.48959521094642|  0.8579866620150369| 0.8579866620150369|                0.0|              null|
|Current Available...| 2021|     3|   4|    2|240.48959521094642|2021-03-04 02:00:00|1614819600|192.37784787942516|  -48.11174733152126|                0.0|  48.11174733152126|              null|
|Current Available...| 2021|     3|   4|    3|192.37784787942516|2021-03-04 03:00:00|1614823200|192.96993537510536|  0.5920874956802038| 0.5920874956802038|                0.0|              null|
|Current Available...| 2021|     3|   4|    4|192.96993537510536|2021-03-04 04:00:00|1614826800|193.60104726861024|  0.6311118935048796| 0.6311118935048796|                0.0|              null|
+--------------------+-----+------+----+-----+------------------+-------------------+----------+------------------+--------------------+-------------------+-------------------+------------------+
using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Expressions;
using Microsoft.Spark.Sql.Types;

namespace StackOverflow
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();

            var df = spark.CreateDataFrame(new List<GenericRow>()
            {
                new GenericRow(new object[]{1, 0.0, 1.226}),
                new GenericRow(new object[]{2, 0.0, 33.09}),
                new GenericRow(new object[]{3, 3.3, 0.0}),
            new GenericRow(new object[]{4, 0.0, 2.67}),
            new GenericRow(new object[]{5, 0.0, 2.67}),
            new GenericRow(new object[]{6, 0.0, 2.67}),
            new GenericRow(new object[]{7, 7.7, 0.0}),
            new GenericRow(new object[]{8, 0.0, 2.67}),
            new GenericRow(new object[]{9, 9.9, 0.0}),
            new GenericRow(new object[]{10, 10.1, 0.0}),
            new GenericRow(new object[]{11, 11.11, 0.0}),
            new GenericRow(new object[]{12, 12.12, 0.0}),
            new GenericRow(new object[]{13, 13.13, 0.0}),
            new GenericRow(new object[]{14, 14.14, 0.0}),
            new GenericRow(new object[]{15, 15.15, 0.0}),
            new GenericRow(new object[]{16, 16.16, 0.0}),
            new GenericRow(new object[]{17, 17.17, 0.0}),
            new GenericRow(new object[]{18, 18.18, 0.0}),
            new GenericRow(new object[]{19, 19.19, 0.0})
            }, new StructType(new List<StructField>()
            {
                new StructField("Row", new IntegerType()),
                new StructField("Gain", new DoubleType()),
                new StructField("Loss", new DoubleType()),
            }));

            df.Show();
            
//First use a window of the last 14 rows
            var lastFourteenRowsWindow = Window.OrderBy(Functions.Desc("Row")).RowsBetween(0, 14);
            
//Save the sum of the last fourteen rows
            var lastFourteenGains = df.WithColumn("LastFourteenGains", Functions.Sum("Gain").Over(lastFourteenRowsWindow));

//calculate the average of those (there is also an avg function you could use instead of sum/14)
            var averageGain =
                lastFourteenGains.WithColumn("AverageGain", Functions.Col("LastFourteenGains") / 14);

//create second window that doesn't have the 14 requirement
            var rowWindow = Window.OrderBy(Functions.Desc("Row"));

//use the new window to retrieve the previous gain
            var previousGains = averageGain.WithColumn("PreviousAverageGain",
                Functions.Lead("AverageGain", 1).Over(rowWindow));

//Previous Gain / 13 + (Sum(Last 14 Gains)/14)
            var result = previousGains.WithColumn("CurrentAverageGains",
                ((Functions.Col("PreviousAverageGain") / 13) + Functions.Col("AverageGain")) / 14);

            result.Show();
        }
    }
}

我已经计算了
增益
损耗
和首次平均增益(
gain1=5.663852926539233
,因为计算RSI的时间间隔为14)

现在我在计算第15行以后的其他平均收益时遇到了问题。这个公式是递归的,我不知道如何实现它。 到目前为止,我尝试了窗口函数,但没有得到正确的结果

WindowSpec windowRSI3 = Microsoft.Spark.Sql.Expressions.Window
     .PartitionBy("TimeSeriesType")
     .OrderBy("Year0", "Month0", "Day0", "Hour0");
DataFrame rsiCalcPos6 = rsiCalcPos5.WithColumn("avgGainj", When(Col("gain1").IsNull(),
      (Lag(Col("gain1"), 1, 0).Multiply(13 / 14).Minus((Col("gain").Multiply(-1 / 14))
      .Over(windowRSI3)))).Otherwise(Col("gain1")));

在这里,我得到一个例外:

org.apache.spark.sql.AnalysisException:在窗口函数中不支持表达式“(增益#175*cast(0为双精度))”

我要使用的递归公式需要一次计算一个
avgGainj
,并在计算下一个
avgGain(j+1)
时使用此结果


如有任何建议,将不胜感激。谢谢

我不确定我的公式是否完全正确,但我会这样处理:

+--------------------+-----+------+----+-----+------------------+-------------------+----------+------------------+--------------------+-------------------+-------------------+------------------+
|      TimeSeriesType|Year0|Month0|Day0|Hour0|        avg(Value)|          Timestamp|  UnixTime|         nextValue|          deltaValue|               gain|               loss|             gain1|
+--------------------+-----+------+----+-----+------------------+-------------------+----------+------------------+--------------------+-------------------+-------------------+------------------+
|Current Available...| 2021|     3|   3|    9| 219.8235294117647|2021-03-03 09:00:00|1614758400|218.59733449857987| -1.2261949131848269|                0.0| 1.2261949131848269|               0.0|
|Current Available...| 2021|     3|   3|   10|218.59733449857987|2021-03-03 10:00:00|1614762000|185.59442632671212| -33.002908171867745|                0.0| 33.002908171867745|               0.0|
|Current Available...| 2021|     3|   3|   11|185.59442632671212|2021-03-03 11:00:00|1614765600| 190.5523781944545|   4.957951867742366|  4.957951867742366|                0.0|1.6526506225807889|
|Current Available...| 2021|     3|   3|   12| 190.5523781944545|2021-03-03 12:00:00|1614769200|187.88173813444055| -2.6706400600139375|                0.0| 2.6706400600139375|1.2394879669355916|
|Current Available...| 2021|     3|   3|   13|187.88173813444055|2021-03-03 13:00:00|1614772800| 187.6245558053521|-0.25718232908846517|                0.0|0.25718232908846517|0.9915903735484732|
|Current Available...| 2021|     3|   3|   14| 187.6245558053521|2021-03-03 14:00:00|1614776400|186.56644553819817| -1.0581102671539213|                0.0| 1.0581102671539213|0.8263253112903944|
|Current Available...| 2021|     3|   3|   15|186.56644553819817|2021-03-03 15:00:00|1614780000|186.66761484852796| 0.10116931032979437|0.10116931032979437|                0.0|0.7227315968674516|
|Current Available...| 2021|     3|   3|   16|186.66761484852796|2021-03-03 16:00:00|1614783600|165.79466929911155| -20.872945549416414|                0.0| 20.872945549416414|0.6323901472590201|
|Current Available...| 2021|     3|   3|   17|165.79466929911155|2021-03-03 17:00:00|1614787200|178.60478239401849|  12.810113094906939| 12.810113094906939|                0.0|1.9854704747754555|
|Current Available...| 2021|     3|   3|   18|178.60478239401849|2021-03-03 18:00:00|1614790800| 215.3916108565386|  36.786828462520106| 36.786828462520106|                0.0| 5.465606273549921|
|Current Available...| 2021|     3|   3|   19| 215.3916108565386|2021-03-03 19:00:00|1614794400|221.27369459516595|   5.882083738627358|  5.882083738627358|                0.0| 5.503467861284233|
|Current Available...| 2021|     3|   3|   20|221.27369459516595|2021-03-03 20:00:00|1614798000|231.88854705635575|  10.614852461189798| 10.614852461189798|                0.0|  5.92941657794303|
|Current Available...| 2021|     3|   3|   21|231.88854705635575|2021-03-03 21:00:00|1614801600|238.82354991634134|  6.9350028599855875| 6.9350028599855875|                0.0| 6.006769368869381|
|Current Available...| 2021|     3|   3|   22|238.82354991634134|2021-03-03 22:00:00|1614805200|240.02948909258865|  1.2059391762473126| 1.2059391762473126|                0.0| 5.663852926539233|
|Current Available...| 2021|     3|   3|   23|240.02948909258865|2021-03-03 23:00:00|1614808800|240.92351533915001|  0.8940262465613671| 0.8940262465613671|                0.0|              null|
|Current Available...| 2021|     3|   4|    0|240.92351533915001|2021-03-04 00:00:00|1614812400|239.63160854893138| -1.2919067902186328|                0.0| 1.2919067902186328|              null|
|Current Available...| 2021|     3|   4|    1|239.63160854893138|2021-03-04 01:00:00|1614816000|240.48959521094642|  0.8579866620150369| 0.8579866620150369|                0.0|              null|
|Current Available...| 2021|     3|   4|    2|240.48959521094642|2021-03-04 02:00:00|1614819600|192.37784787942516|  -48.11174733152126|                0.0|  48.11174733152126|              null|
|Current Available...| 2021|     3|   4|    3|192.37784787942516|2021-03-04 03:00:00|1614823200|192.96993537510536|  0.5920874956802038| 0.5920874956802038|                0.0|              null|
|Current Available...| 2021|     3|   4|    4|192.96993537510536|2021-03-04 04:00:00|1614826800|193.60104726861024|  0.6311118935048796| 0.6311118935048796|                0.0|              null|
+--------------------+-----+------+----+-----+------------------+-------------------+----------+------------------+--------------------+-------------------+-------------------+------------------+
using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Expressions;
using Microsoft.Spark.Sql.Types;

namespace StackOverflow
{
    class Program
    {
        static void Main(string[] args)
        {
            var spark = SparkSession.Builder().GetOrCreate();

            var df = spark.CreateDataFrame(new List<GenericRow>()
            {
                new GenericRow(new object[]{1, 0.0, 1.226}),
                new GenericRow(new object[]{2, 0.0, 33.09}),
                new GenericRow(new object[]{3, 3.3, 0.0}),
            new GenericRow(new object[]{4, 0.0, 2.67}),
            new GenericRow(new object[]{5, 0.0, 2.67}),
            new GenericRow(new object[]{6, 0.0, 2.67}),
            new GenericRow(new object[]{7, 7.7, 0.0}),
            new GenericRow(new object[]{8, 0.0, 2.67}),
            new GenericRow(new object[]{9, 9.9, 0.0}),
            new GenericRow(new object[]{10, 10.1, 0.0}),
            new GenericRow(new object[]{11, 11.11, 0.0}),
            new GenericRow(new object[]{12, 12.12, 0.0}),
            new GenericRow(new object[]{13, 13.13, 0.0}),
            new GenericRow(new object[]{14, 14.14, 0.0}),
            new GenericRow(new object[]{15, 15.15, 0.0}),
            new GenericRow(new object[]{16, 16.16, 0.0}),
            new GenericRow(new object[]{17, 17.17, 0.0}),
            new GenericRow(new object[]{18, 18.18, 0.0}),
            new GenericRow(new object[]{19, 19.19, 0.0})
            }, new StructType(new List<StructField>()
            {
                new StructField("Row", new IntegerType()),
                new StructField("Gain", new DoubleType()),
                new StructField("Loss", new DoubleType()),
            }));

            df.Show();
            
//First use a window of the last 14 rows
            var lastFourteenRowsWindow = Window.OrderBy(Functions.Desc("Row")).RowsBetween(0, 14);
            
//Save the sum of the last fourteen rows
            var lastFourteenGains = df.WithColumn("LastFourteenGains", Functions.Sum("Gain").Over(lastFourteenRowsWindow));

//calculate the average of those (there is also an avg function you could use instead of sum/14)
            var averageGain =
                lastFourteenGains.WithColumn("AverageGain", Functions.Col("LastFourteenGains") / 14);

//create second window that doesn't have the 14 requirement
            var rowWindow = Window.OrderBy(Functions.Desc("Row"));

//use the new window to retrieve the previous gain
            var previousGains = averageGain.WithColumn("PreviousAverageGain",
                Functions.Lead("AverageGain", 1).Over(rowWindow));

//Previous Gain / 13 + (Sum(Last 14 Gains)/14)
            var result = previousGains.WithColumn("CurrentAverageGains",
                ((Functions.Col("PreviousAverageGain") / 13) + Functions.Col("AverageGain")) / 14);

            result.Show();
        }
    }
}

使用系统;
使用System.Collections.Generic;
使用Microsoft.Spark.Sql;
使用Microsoft.Spark.Sql.Expressions;
使用Microsoft.Spark.Sql.Types;
命名空间堆栈溢出
{
班级计划
{
静态void Main(字符串[]参数)
{
var spark=SparkSession.Builder().GetOrCreate();
var df=spark.CreateDataFrame(新列表()
{
新的GenericRow(新对象[]{1,0.0,1.226}),
新的GenericRow(新对象[]{2,0.0,33.09}),
新的GenericRow(新对象[]{3,3.3,0.0}),
新的GenericRow(新对象[]{4,0.0,2.67}),
新的GenericRow(新对象[]{5,0.0,2.67}),
新的GenericRow(新对象[]{6,0.0,2.67}),
新的GenericRow(新对象[]{7,7.7,0.0}),
新的GenericRow(新对象[]{8,0.0,2.67}),
新的GenericRow(新对象[]{9,9.9,0.0}),
新的GenericRow(新对象[]{10,10.1,0.0}),
新的GenericRow(新对象[]{11,11.11,0.0}),
新的GenericRow(新对象[]{12,12.12,0.0}),
新的GenericRow(新对象[]{13,13.13,0.0}),
新的GenericRow(新对象[]{14,14.14,0.0}),
新的GenericRow(新对象[]{15,15.15,0.0}),
新的GenericRow(新对象[]{16,16.16,0.0}),
新的GenericRow(新对象[]{17,17.17,0.0}),
新的GenericRow(新对象[]{18,18.18,0.0}),
新的GenericRow(新对象[]{19,19.19,0.0})
},新结构类型(新列表()
{
new StructField(“行”,new IntegerType()),
新建StructField(“增益”,新的DoubleType()),
新建StructField(“丢失”,新建DoubleType()),
}));
df.Show();
//首先使用最后14行的窗口
var lastFourteenRowsWindow=Window.OrderBy(Functions.Desc(“Row”)).RowsBetween(0,14);
//保存最后十四行的总和
var lastFourteenGains=df.WithColumn(“lastFourteenGains”,Functions.Sum(“增益”).Over(lastFourteenRowsWindow));
//计算这些值的平均值(还有一个avg函数可以代替sum/14)
var平均增益=
lastFourteenGains.WithColumn(“平均增益”,Functions.Col(“lastFourteenGains”)/14);
//创建第二个不具备14要求的窗口
var rowWindow=Window.OrderBy(Functions.Desc(“Row”);
//使用新窗口检索以前的增益
var previousGains=averageGain.WithColumn(“PreviousAverageGain”),
函数.Lead(“AverageGain”,1).Over(rowWindow));
//先前增益/13+(总和(最后14个增益)/14)
var结果=先前的增益。WithColumn(“CurrentAverageGains”,
((Functions.Col(“PreviousAverageGain”)/13)+Functions.Col(“AverageGain”)/14);
result.Show();
}
}
}
如果在每个阶段之间执行.Show(),则可以验证它是否正确


ed

ed,谢谢你的回答。第一个平均增益和损耗应该是前14行,而不是最后14行。无论如何,我在for循环中计算了RSI。我知道这在Spark中远远不够有效,但到目前为止我还没有找到更好的解决方案。谢谢你的帮助