使用向量化和矩阵进行R中的回归_R_Matrix_Dataframe_Vectorization_Regression

使用向量化和矩阵进行R中的回归

r matrix dataframe

使用向量化和矩阵进行R中的回归,r,matrix,dataframe,vectorization,regression,R,Matrix,Dataframe,Vectorization,Regression,我有一个向量化Q，在R中使用矩阵。我有2个COL，需要使用特定的指数对每个COL进行回归。数据是 matrix_senttoR = [ ... 0.11 0.95 0.23 0.34 0.67 0.54 0.65 0.95 0.12 0.54 0.45 0.43 ] ; indice

我有一个向量化Q，在R中使用矩阵。我有2个COL，需要使用特定的指数对每个COL进行回归。数据是

matrix_senttoR = [ ...
                  0.11 0.95
                  0.23 0.34
                  0.67 0.54
                  0.65 0.95
                  0.12 0.54
                  0.45 0.43 ] ;
indices_forR = [ ...
            1
            1
            1
            2
            2
            2 ] ;

矩阵中的Col1是MSFT和GOOG（各3行）的数据，Col2是基准StkIndex在相应日期的返回值。数据是从Matlab发送的矩阵格式

我目前使用

slope <- by(    data.frame(matrix_senttoR),   indices_forR,   FUN=function(x)  
                         {zyp.sen (X1~X2,data=x) $coeff[2] }      ) 
betasFac <- sapply(slope , function(x) x+0)

slope这可以很容易地移动到注释中，但是：
一些要考虑的事情，我倾向于避免<代码>（）<代码>函数，因为它的返回值是一个时髦的对象。相反，为什么不将索引添加到data.frame
df <- data.frame(matrix_senttoR) 
df$indices_forR <- indices_forR

您可以使用doMC或doSnow以及ddply的参数.parallel=TRUE
轻松完成此操作
如果以速度为目标，我还将学习这个包（它包装data.frame，速度更快）。另外，我假设较慢的部分是zyp.sen（）
调用，而不是by（）
调用。在多个核上执行将加快这一进程
> dput(df)
structure(list(X1 = c(0.11, 0.23, 0.67, 0.65, 0.12, 0.45), X2 = c(0.95, 
0.34, 0.54, 0.95, 0.54, 0.43), indices_forR = c(1, 1, 1, 2, 2, 
2)), .Names = c("X1", "X2", "indices_forR"), row.names = c(NA, 
-6L), class = "data.frame")

> ddply(df,.(indices),function(x) lm(X1~X2,data=x)$coeff[2])
  indices         X2
1       1 -0.3702172
2       2  0.6324900

我仍然认为，通过从MATLAB到R再到R，你将事情复杂化了。而传递15万行数据肯定会大大降低速度
zyp.sen
实际上移植到MATLAB非常简单。给你：
function [intercept, slope, intercepts, slopes, rank, residuals] = ZypSen(x, y)
% Computes a Thiel-Sen estimate of slope for a vector of data.

n = length(x);

slopes = arrayfun(@(i) ZypSlopediff(i, x, y, n), 1:(n - 1), ...
    'UniformOutput', false);
slopes =  [slopes{:}];
sni = isfinite(slopes);
slope = median(slopes(sni));

intercepts = y - slope * x;
intercept = median(intercepts);

rank = 2;
residuals = x - slope * y + intercept;

end


function z = ZypSlopediff(i, x, y, n)

z = (y(1:(n - i)) - y((i + 1):n)) ./ ...
    (x(1:(n - i)) - x((i + 1):n));

end

我使用R的示例（zyp.sen）
检查了这个问题，它给出了相同的答案
x = [0 1 2 4 5]
y = [6 4 1 8 7]
[int, sl, ints, sls, ra, res] = ZypSen(x, y)

不过，您确实应该做一些进一步的检查，以确保这一点。
如果您只是在运行回归，为什么还要麻烦将代码从MATLAB传递到R？统计工具箱中的MATLAB的回归
函数就可以实现这一点。对代码进行一些分析以了解减速的原因也是一个好主意。你需要知道

和建模函数占用了多少时间，以及在MATLAB和R之间传递数据花费了多少时间。@Richie->这是因为我试图进行非参数回归，特别是使用zyp库包。我所有的数据都在Matlab中。我唯一的选择是自己在Matlab中设计泰尔森回归器！->步骤ddply在像evalR（'slope@Maddy，听起来像是Matlab错误而不是R错误。不确定哪个包

zyp.sen（）

来自，但使用

lm（）

在R中，它本身工作得很好。->谢谢Justin。但我想我现在还是坚持使用zyp。它是非参数的，我必须特别使用它。我知道这是基于matlab的错误。我对R不是很精通，所以我把这个Q放在这里希望找到解决方案。

x = [0 1 2 4 5]
y = [6 4 1 8 7]
[int, sl, ints, sls, ra, res] = ZypSen(x, y)