R数据表:当前测量前的计数
我有一套在几天内进行的测量。测量次数通常为4次。可以在任何测量中捕获的数字范围为1-5(在现实生活中,给定测试集,该范围可能高达100或低至20) 我想每天计算在当前日期之前每个值发生了多少次 让我用一些样本数据解释一下:R数据表:当前测量前的计数,r,function,data.table,R,Function,Data.table,我有一套在几天内进行的测量。测量次数通常为4次。可以在任何测量中捕获的数字范围为1-5(在现实生活中,给定测试集,该范围可能高达100或低至20) 我想每天计算在当前日期之前每个值发生了多少次 让我用一些样本数据解释一下: # test data creation d1 = list(as.Date("2013-5-4"), 4,2) d2 = list(as.Date("2013-5-9"), 2,5) d3 = list(as.Date("2013-5-16"), 3,2) d4 = l
# test data creation
d1 = list(as.Date("2013-5-4"), 4,2)
d2 = list(as.Date("2013-5-9"), 2,5)
d3 = list(as.Date("2013-5-16"), 3,2)
d4 = list(as.Date("2013-5-30"), 1,4)
d = rbind(d1,d2,d3,d4)
colnames(d) <- c("Date", "V1", "V2")
tt = as.data.table(d)
您可能需要%运算符中的
%
> foo<-sample(1:10,4)
> bar<-sample(1:10,3)
> foo
[1] 5 3 9 6
> bar
[1] 1 7 2
> bar2<-sample(1:10,5)
> bar2
[1] 2 9 4 8 5
> which(bar2%in%foo)
[1] 2 5 #those are the indices of the values in bar2 which appear in foo
> which(bar%in%foo)
integer(0)
>foo bar foo
[1] 5 3 9 6
>酒吧
[1] 1 7 2
>bar2 bar2
[1] 2 9 4 8 5
>其中(bar2%在%foo中)
[1] 2 5#这些是bar2中出现在foo中的值的索引
>其中(条形图%foo中的%foo)
整数(0)
这是一个开始。我看不出有什么理由“一炮打响”。这是可能的。试试你自己
library(data.table)
DT = as.data.table(d)
DT[,i:=as.numeric(Date)]
setkey(DT,"i")
uv <- 1:max(unlist(DT[,2:3]))
DT[,paste0("C",uv):=lapply(uv,function(x) x %in% unlist(.SD)),.SDcols=2:3,by=i]
DT[,paste0("C",uv):=lapply(.SD,function(x) c(NA,head(cumsum(x),-1))),.SDcols=paste0("C",uv)]
DT[,paste0("PC",uv):=lapply(.SD,function(x) x/(2*.I-2)),.SDcols=paste0("C",uv)]
# Date V1 V2 i C1 C2 C3 C4 C5 PC1 PC2 PC3 PC4 PC5
# 1: 2013-05-04 4 2 15829 NA NA NA NA NA NA NA NA NA NA
# 2: 2013-05-09 2 5 15834 0 1 0 1 0 0 0.5 0.0000000 0.5000000 0.0000000
# 3: 2013-05-16 3 2 15841 0 2 0 1 1 0 0.5 0.0000000 0.2500000 0.2500000
# 4: 2013-05-30 1 4 15855 0 3 1 1 1 0 0.5 0.1666667 0.1666667 0.1666667
库(data.table)
DT=表(d)中的原始数据
DT[,i:=作为数字(日期)]
设置键(DT,“i”)
紫外线1。数据表
首先,将问题中的t
的奇怪结构替换为更常见的结构:
library(data.table)
t <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
V1 = c(4, 2, 3, 1),
V2 = c(2, 5, 2, 4)
)
最后一行给出:
Date V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5
1: 2013-05-04 4 2 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN
2: 2013-05-09 2 5 0 0 1 0.5 0 0.0000000 1 0.5000000 0 0.0000000
3: 2013-05-16 3 2 0 0 2 0.5 0 0.0000000 1 0.2500000 1 0.2500000
4: 2013-05-30 1 4 0 0 3 0.5 1 0.1666667 1 0.1666667 1 0.1666667
1a.data.table备选方案
如果可以省略第一个日期的行(这不是很有用,因为在第一个日期之前没有日期),那么我们可以执行以下繁琐但直接的自联接:
t <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
V1 = c(4, 2, 3, 1),
V2 = c(2, 5, 2, 4)
)
tt <- t[, one := 1]
setkey(tt, one)
tt[tt,,allow.cartesian=TRUE][Date > Date.1, list(
C1 = sum(.SD == 1), PC1 = mean(.SD == 1),
C2 = sum(.SD == 2), PC2 = mean(.SD == 2),
C3 = sum(.SD == 3), PC3 = mean(.SD == 3),
C4 = sum(.SD == 4), PC4 = mean(.SD == 4),
C5 = sum(.SD == 5), PC5 = mean(.SD == 5)
), by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]
2。sqldf
data.table的另一种替代方法是使用SQL进行类似的繁琐但直接的自连接:
library(sqldf)
sqldf("select a.Date, a.V1, a.V2,
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) C1,
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC1,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) C2,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC2,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) C3,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC3,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) C4,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC4,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) C5,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC5
from t a, t b where a.Date >= b.Date
group by a.Date")
2a。sqldf备选方案
另一种方法是使用字符串操作创建上述sql字符串,如下所示:
f <- function(i) {
s <- fn$identity("sum(((b.V1 = $i) + (b.V2 = $i)) * (a.Date > b.Date))")
fn$identity("$s C$i,\n $s /\ncast (2 * count(*) - 2 as real) PC$i")
}
s <- fn$identity("select a.Date, a.V1, a.V2, `toString(sapply(1:5, f))`
from t a, t b where a.Date >= b.Date
group by a.Date")
sqldf(s)
同样,可以创建sql字符串,以避免以与前面的解决方案相同的方式进行重复
更新:增加了PC列和一些简化
更新2:添加其他解决方案我喜欢初始解决方案,因为它将所有原始数据和新分析数据一起提供给我(即使第一行是垃圾数据)。我正在努力理解它如何只考虑“当前”行和“之后”(而不是“之前”)的每一行。第一个解决方案似乎深入到了多层作业中,这是一种有趣的方法。我想尝试一下,但它会返回错误。1. '上述代码中未定义d'DT[,paste0(“PC”,uv):=lappy(.SD,函数(x)x/(2*.I-2)),.SDcols=paste0(“C”,uv)]在[.data.frame
(DT,:=
(paste0(“PC”,uv),lappy(.SD,函数(x)x/(2*):未使用的参数(.SDcols=paste0(“C”,uv))头(x,-1)
删除x的最后一行/位置。我在第一个位置添加NA以获得正确长度的向量。
tt[tt,,allow.cartesian=TRUE][Date > Date.1, setNames(as.list(rbind(
sapply(1:n, function(i, .SD) sum(.SD==i), .SD=.SD),
sapply(1:n, function(i, .SD) mean(.SD==i), .SD=.SD)
)), c(rbind(Cnames, PCnames))),
by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]
library(sqldf)
sqldf("select a.Date, a.V1, a.V2,
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) C1,
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC1,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) C2,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC2,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) C3,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC3,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) C4,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC4,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) C5,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) /
cast (2 * count(*) - 2 as real) PC5
from t a, t b where a.Date >= b.Date
group by a.Date")
f <- function(i) {
s <- fn$identity("sum(((b.V1 = $i) + (b.V2 = $i)) * (a.Date > b.Date))")
fn$identity("$s C$i,\n $s /\ncast (2 * count(*) - 2 as real) PC$i")
}
s <- fn$identity("select a.Date, a.V1, a.V2, `toString(sapply(1:5, f))`
from t a, t b where a.Date >= b.Date
group by a.Date")
sqldf(s)
sqldf("select a.Date, a.V1, a.V2,
sum((b.V1 = 1) + (b.V2 = 1)) C1,
avg((b.V1 = 1) + (b.V2 = 1)) PC1,
sum((b.V1 = 2) + (b.V2 = 2)) C2,
avg((b.V1 = 2) + (b.V2 = 2)) PC2,
sum((b.V1 = 3) + (b.V2 = 3)) C3,
avg((b.V1 = 3) + (b.V2 = 3)) PC3,
sum((b.V1 = 4) + (b.V2 = 4)) C4,
avg((b.V1 = 4) + (b.V2 = 4)) PC4,
sum((b.V1 = 5) + (b.V2 = 5)) C5,
avg((b.V1 = 5) + (b.V2 = 5)) PC5
from t a, t b where a.Date > b.Date
group by a.Date")