Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何将多个列传递给dplyr::summary中的函数_R_Dplyr - Fatal编程技术网

如何将多个列传递给dplyr::summary中的函数

如何将多个列传递给dplyr::summary中的函数,r,dplyr,R,Dplyr,我正在尝试将data.frame中与条件匹配的所有列传递给dplyr的Summary函数中的函数,如下所示: df %>% group_by(Version, Type) %>% summarize(mcll(TrueClass, starts_with("pred"))) Error: argument is of length zero 有办法做到这一点吗?工作示例如下: 建立样本预测的模拟数据框架。这些被解释为分类算法的输出 library(dplyr) nrow &l

我正在尝试将data.frame中与条件匹配的所有列传递给dplyr的Summary函数中的函数,如下所示:

df %>% group_by(Version, Type) %>%
  summarize(mcll(TrueClass, starts_with("pred")))

Error: argument is of length zero
有办法做到这一点吗?工作示例如下:

建立样本预测的模拟数据框架。这些被解释为分类算法的输出

library(dplyr)
nrow <- 40
ncol <- 4
set.seed(567879)

getProbs <- function(i) {
  p <- runif(i)
  return(p / sum(p))
}
df <- data.frame(matrix(NA, nrow, ncol))
for (i in seq(nrow)) df[i, ] <- getProbs(ncol)
names(df) <- paste0("pred.", seq(ncol))
理想情况下,我可以在不更改
mcll()
函数的情况下执行此操作,但如果它简化了其他代码,我愿意这样做

谢谢

编辑:注意,mcll的输入是真值向量和概率矩阵,每个“pred”列有一列。对于每个数据子集,mcll都应该返回一个标量。我可以通过下面的代码得到我想要的,但是我希望在dplyr的上下文中得到一些东西

mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
  for (type in unique(df$Type)) {
    subdat <- df %>% filter(Type == type & Version == ver)
    val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
    mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
    count = count + 1
  }
}
head(mcll_df)
  Type Version             mcll
1    a       1 1.42972507510096
2    b       1 1.97189000832723
3    a       2 1.97988830406062
4    b       2 1.21387875938737
5    a       3 1.30629638026735
6    b       3 1.48799237895462

mcll\u df我不得不稍微更改一下
mcll
功能,但后来它起了作用。第二个
if
语句出现问题。您告诉函数获取
nrow(pred)
,但是如果您在多个列上进行汇总,那么实际上每次只提供一个向量(因为每个列都是单独分析的)。此外,我还切换了输入函数的参数的顺序

mcll <- function (pred, act) 
{
  if (class(act) != "factor") {
    stop("act must be a factor")
  }
   pred[pred == 0] <- 1e-15
   pred[pred == 1] <- 1 - 1e-15

  dummies <- model.matrix(~act - 1)
  if (nrow(dummies) != length(pred)) { # the main change is here
    return(0)
  }
  return(-1 * (sum(dummies * log(pred)))/length(act))
}
df %>% group_by(Version,Type) %>% summarise_each(funs(mcll(., TrueClass)), matches("pred"))

  Version  Type   pred.1   pred.2   pred.3   pred.4
    (int) (chr)    (dbl)    (dbl)    (dbl)    (dbl)
1       1     a 1.475232 1.972779 1.743491 1.161984
2       1     b 2.030829 1.331629 1.397577 1.484865
3       2     a 1.589256 1.740858 1.898906 2.005511
我对照数据的一个子集检查了这一点,看起来它是有效的

mcll(df$pred.1[which(df$Type=="a" & df$Version==1)],
 df$TrueClass[which(df$Type=="a" & df$Version==1)])

[1] 1.475232 #pred.1 mcll when Version equals 1 and Type equals a.

使用
数据很容易做到这一点。表

library(data.table)

setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))] 
#   Version Type       V1
#1:       1    a 1.429725
#2:       2    a 1.979888
#3:       3    a 1.306296
#4:       4    a 1.668330
#5:       1    b 1.971890
#6:       2    b 1.213879
#7:       3    b 1.487992
#8:       4    b 1.171286

我知道,但这似乎是不可能的。你需要有适当的上下文来调用那些用
函数启动的
start\u,我认为这在
summary()
中是不可用的(或者至少在我看的时候是不可用的)。理论上,
df%>%group\u by(Version,Type)%%>%summary\u at(vars(start\u with(“pred”)),funs(mcll(TrueClass,)
应该这样做@卢卡,这是我的第一个猜测,但它不起作用…不错,但不完全是我想要的。为了更清楚,我对上面的问题进行了编辑。每个pred列应在单个数据帧中绑定在一起,并作为pred参数提供给mcll,mcll应为数据的每个子集返回标量。好消息,所以也许我现在可以自己弄清楚。我希望有一个dplyr的方式,但这确实奏效了。谢谢!
mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
  for (type in unique(df$Type)) {
    subdat <- df %>% filter(Type == type & Version == ver)
    val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
    mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
    count = count + 1
  }
}
head(mcll_df)
  Type Version             mcll
1    a       1 1.42972507510096
2    b       1 1.97189000832723
3    a       2 1.97988830406062
4    b       2 1.21387875938737
5    a       3 1.30629638026735
6    b       3 1.48799237895462
mcll <- function (pred, act) 
{
  if (class(act) != "factor") {
    stop("act must be a factor")
  }
   pred[pred == 0] <- 1e-15
   pred[pred == 1] <- 1 - 1e-15

  dummies <- model.matrix(~act - 1)
  if (nrow(dummies) != length(pred)) { # the main change is here
    return(0)
  }
  return(-1 * (sum(dummies * log(pred)))/length(act))
}
df %>% group_by(Version,Type) %>% summarise_each(funs(mcll(., TrueClass)), matches("pred"))

  Version  Type   pred.1   pred.2   pred.3   pred.4
    (int) (chr)    (dbl)    (dbl)    (dbl)    (dbl)
1       1     a 1.475232 1.972779 1.743491 1.161984
2       1     b 2.030829 1.331629 1.397577 1.484865
3       2     a 1.589256 1.740858 1.898906 2.005511
mcll(df$pred.1[which(df$Type=="a" & df$Version==1)],
 df$TrueClass[which(df$Type=="a" & df$Version==1)])

[1] 1.475232 #pred.1 mcll when Version equals 1 and Type equals a.
library(data.table)

setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))] 
#   Version Type       V1
#1:       1    a 1.429725
#2:       2    a 1.979888
#3:       3    a 1.306296
#4:       4    a 1.668330
#5:       1    b 1.971890
#6:       2    b 1.213879
#7:       3    b 1.487992
#8:       4    b 1.171286