如何创建在特定条件下计算另一列的列?R
下面,数据已被重新调整,并列出了输入和预期输出 数据如何创建在特定条件下计算另一列的列?R,r,dplyr,R,Dplyr,下面,数据已被重新调整,并列出了输入和预期输出 数据 structure(list(record_id = c(110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
structure(list(record_id = c(110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101,
110101, 110101, 110101, 110101, 110101, 110101, 110101, 110101
), start = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59), stop = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60), `treatment (type)` = c(1,
1, 1, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 3, 3, 0, 3, 3, 3,
0, 2, 2, 2, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), n_interruption_periods = c(0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), n_interruption_periods_3days = c(0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), n_interruption_days_3days = c(0,
0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7)), row.names = c(NA,
-60L), class = c("tbl_df", "tbl", "data.frame"))
解释
输入
开始
和停止
是天数。每日治疗在治疗中列出,0=不治疗,这是一种中断,1:3是治疗A/B/C
输出
根据治疗
列,我想每天计算:
n\u中断\u周期
:中断周期的总和/数量,与中断的持续时间无关n\u中断\u期间\u 3天
:总和/中断次数,条件是仅当持续时间>=3天时才应计数。短于3天的中断不值得关注n\u中断天数\u 3天
:中断天数的累计总和/数量,其中中断仅从中断的第3天开始计算treatment
变量自动计算上述输出变量
希望你能帮忙
体重
响应OP
以下是说明问题的部分数据:
structure(list(record_id = c(110001, 110002, 110002, 110002,
110001), day_count = c(732, 0, 1, 2, 733), day_count_stop = c(733,
1, 2, 3, 734), oac_class = c(0, 1, 1, 1, 1), n_interruption_periods = c(1,
1, 0, 0, 1), n_interruption_periods_3days = c(1, 1, 0, 0, 1)), row.names = c(NA,
-5L), groups = structure(list(record_id = c(110001, 110002),
.rows = structure(list(c(1L, 5L), 2:4), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
对于建议的代码,有两个问题:
n个中断时间段
和n个中断时间段3天
的第一个数据从110001个结果扩展而来
BW编辑:完全删除所有内容并重新开始 为了你的缘故,我真的希望有人能给出一个不那么混乱的答案,但是这些函数应该可以工作
FindFirstVector = function(TreatmentVector){
#Which entries are equal to 0
id = which(TreatmentVector == 0)
#IDs of first zeros occuring (First day w.o. treatment)
id1 = id[c(0,diff(id)) != 1]
#Create vector of zeroes
temp = rep(0,length(TreatmentVector))
#Add 1 for the first zero
temp[id1] = 1
#Take cumulative sum
cumsum(temp)
}
FindSecondVector = function(TreatmentVector){
#Which entries are equal to 0
id = which(TreatmentVector == 0)
#IDs of first zeros occuring (First day w.o. treatment)
id1 = id[c(0,diff(id)) != 1]
#IDs of last zeros (Last day w.o. treatment)
id2 = id[c(diff(id),2) > 1]
#Amount of days w.o. treatment is then:
d = id2 - id1 + 1
#id3 is then the starting id of period of no treatment, if the period is longer
#than 2 days. Then 2 is added, so start counting from day 3 of the period.
id3 = id1[id2 - id1 + 1 > 2] + 2
temp = rep(0,length(TreatmentVector))
temp[id3] = 1
cumsum(temp)
}
# Building third vector ---------------------------------------------------
FindThirdVector = function(TreatmentVector){
#Which entries are equal to 0
id = which(TreatmentVector == 0)
#IDs of first zeros occuring (First day w.o. treatment)
id1 = id[c(0,diff(id)) != 1]
#IDs of last zeros (Last day w.o. treatment)
id2 = id[c(diff(id),2) > 1]
#Amount of days w.o. treatment is then:
d = id2 - id1 + 1
#id3 is then the starting id of period of no treatment, if the period is longer
#than 2 days. Then 2 is added, so start counting from day 3 of the period.
id3 = id1[id2 - id1 + 1 > 2] + 2
#The id of the ending day of period w.o. treatment longer than 2 days.
id4 = id2[id2 - id1 + 1 > 2]
#d is the amount of days to add 1's
d = id4-id3
temp = rep(0,length(TreatmentVector))
while(any(d!=0)){
temp[id3 + d] = 1
d = d - 1
d[d<0] = 0
}
temp[id3 + d] = 1
cumsum(temp)
}
下面是一个使用dplyr
的较短(而且在我看来更干净)的解决方案。我不确定您在使用其他解决方案时会出现什么错误,但这可能对您更有效
#按记录id分组
数据=数据%>%分组依据(记录id)
#定义辅助列
count_streak=函数(v)累计(v,~if_else(.y,.x+1,0),.init=0)[-1]
数据=数据%>%突变(中断\条纹=计数\条纹(`treatment(type)`==0))
数据=数据%>%
突变(n_中断_周期=累计(中断_条纹==1),
n\u中断\u周期\u 3天=累计(中断\u条纹==3),
n\u中断\u天\u 3天=累计(中断\u条纹>=3))
我们定义了一个helper列interruption\u streak
,它与当前中断周期的每一天一起计数。因此,在每个中断周期的第一天,它是1
,依此类推
由此,我们可以计算其他列:
只是中断期开始的累计天数n_interruption_periods
是中断期间第三天的累计计数n\u中断\u期间\u 3天
是中断期间第三天或更高天数的累计计数n_interruption_days_3days
我希望这个解释是有意义的,否则你可以自由地问。 < P>我想我们可以修改我上次写的函数来解决你所有的问题。考虑下面的函数。< /P>
conditional_count <- function(x, n, pfill = function(p0) integer(length(p0)), ifill = seq_along, iend = 30L) {
len <- length(x); out <- integer(len)
p0 <- which(x == 0L)
if (n > 1L)
p0 <- Reduce(function(idx, i) {
lidx <- idx - i + 1L
idx <- idx[lidx > 0L]; lidx <- lidx[lidx > 0L]
idx[x[lidx] == 0L]
}, seq_len(n)[-1L], p0)
if (length(p0) < 1L)
return(out)
ub <- pmin(c(tail(p0, -1L), len), p0 + iend - 1L)
rl <- ub - p0 + 1L
pfill <- pfill(p0)
res <- unlist(lapply(seq_along(rl), function(i) ifill(integer(rl[[i]])) + pfill[[i]]))
pos <- inverse.rle(list(lengths = rl, values = p0)) + unlist(lapply(rl, seq_len)) - 1L
`[<-`(out, pos, res)
}
以n=1为例,上一个问题简化为
conditional_count(x, 1L, function(p0) integer(length(p0)), seq_along, 30L)
ifill + pfill : 1 2 3 4 ... 1 2 3 4 ...
ifill is a sequence along the gap positions: 1 2 3 4 ... 1 2 3 4 ...
pfill is always 0 at all positions of p0 : 0 0
p0 identifies : v v
x looks like : 1 2 0 ........ 0
conditional_count(x, 1L, function(p0) cumsum(p0 - head(c(-1L, p0), -1L) > 1L), function(x) integer(length(x)), Inf)
ifill + pfill : 1 1 1 ... 2 2 ...
ifill is always 0 along the gap positions : 0 0 0 ... 0 0 ... (iend = Inf means filling in a sequence until the end of the gap)
pfill increases 1 at each starting streak of 0s: 1 2
p0 identifies : v v v v v
x looks like : 1 2 0 0 0 ....... 0 0 ...
conditional_count(x, 1L, seq_along, function(x) integer(length(x)), Inf)
ifill + pfill : 1 2 3 ... 4 5 ...
ifill is always 0 along the gap positions: 0 0 0 ... 0 0 ... (iend = Inf means filling in a sequence until the end of the gap)
pfill increases 1 at each 0 : 1 2 3 4 5
p0 identifies : v v v v v
x looks like : 1 2 0 0 0 ....... 0 0 ...
这个问题简化为
conditional_count(x, 1L, function(p0) integer(length(p0)), seq_along, 30L)
ifill + pfill : 1 2 3 4 ... 1 2 3 4 ...
ifill is a sequence along the gap positions: 1 2 3 4 ... 1 2 3 4 ...
pfill is always 0 at all positions of p0 : 0 0
p0 identifies : v v
x looks like : 1 2 0 ........ 0
conditional_count(x, 1L, function(p0) cumsum(p0 - head(c(-1L, p0), -1L) > 1L), function(x) integer(length(x)), Inf)
ifill + pfill : 1 1 1 ... 2 2 ...
ifill is always 0 along the gap positions : 0 0 0 ... 0 0 ... (iend = Inf means filling in a sequence until the end of the gap)
pfill increases 1 at each starting streak of 0s: 1 2
p0 identifies : v v v v v
x looks like : 1 2 0 0 0 ....... 0 0 ...
conditional_count(x, 1L, seq_along, function(x) integer(length(x)), Inf)
ifill + pfill : 1 2 3 ... 4 5 ...
ifill is always 0 along the gap positions: 0 0 0 ... 0 0 ... (iend = Inf means filling in a sequence until the end of the gap)
pfill increases 1 at each 0 : 1 2 3 4 5
p0 identifies : v v v v v
x looks like : 1 2 0 0 0 ....... 0 0 ...
这个问题的完整脚本是
conditional_count <- function(x, n, pfill = function(p0) integer(length(p0)), ifill = seq_along, iend = 30L) {
len <- length(x); out <- integer(len)
p0 <- which(x == 0L)
if (n > 1L)
p0 <- Reduce(function(idx, i) {
lidx <- idx - i + 1L
idx <- idx[lidx > 0L]; lidx <- lidx[lidx > 0L]
idx[x[lidx] == 0L]
}, seq_len(n)[-1L], p0)
if (length(p0) < 1L)
return(out)
ub <- pmin(c(tail(p0, -1L), len), p0 + iend - 1L)
rl <- ub - p0 + 1L
pfill <- pfill(p0)
res <- unlist(lapply(seq_along(rl), function(i) ifill(integer(rl[[i]])) + pfill[[i]]))
pos <- inverse.rle(list(lengths = rl, values = p0)) + unlist(lapply(rl, seq_len)) - 1L
`[<-`(out, pos, res)
}
count_streak <- function(p0) cumsum(p0 - head(c(-1L, p0), -1L) > 1L)
integer_along <- function(x) integer(length(x))
df %>%
mutate(
n_interruption_periods = conditional_count(`treatment (type)`, 1L, count_streak, integer_along, Inf),
n_interruption_periods_3days = conditional_count(`treatment (type)`, 3L, count_streak, integer_along, Inf),
n_interruption_days_3days = conditional_count(`treatment (type)`, 3L, seq_along, integer_along, Inf)
)
谢谢,为了测试这是否有效,我尝试了
dat$test cumsum(diff(so\u interruption\u df$treatment(type)
)哦,对了,对不起。我记得好像diff()
总是以0开头。这是一个输出向量中项目之间差异的函数。如果处理中的第一个项目是0,这也会导致问题,因此,经过再三考虑,不建议使用此方法。@KBChu我已经更新了我的答案。但是它比我想象的要长得多。即使使用bas几乎可以肯定地缩短它。@KBChu我想不同的人是由他们的Id定义的?在这种情况下,类似于:n\u interruption\u periods=unname(unlist(unlist)(taply(dat$`treatment(type)`,dat$record\u Id,FindFirstVector)))
应该可以工作。tapply
在这种情况下将输出一个命名向量列表,每个Id对应一个。因此,未列出
和未命名
。如果部分代码解释不足,请随时询问它们的作用。
record_id start stop treatment (type) n_interruption_periods n_interruption_periods_3days n_interruption_days_3days
1 110101 0 1 1 0 0 0
2 110101 1 2 1 0 0 0
3 110101 2 3 1 0 0 0
4 110101 3 4 0 1 0 0
5 110101 4 5 0 1 0 0
6 110101 5 6 0 1 1 1
7 110101 6 7 0 1 1 2
8 110101 7 8 2 1 1 2
9 110101 8 9 2 1 1 2
10 110101 9 10 2 1 1 2
11 110101 10 11 0 2 1 2
12 110101 11 12 0 2 1 2
13 110101 12 13 0 2 2 3
14 110101 13 14 0 2 2 4
15 110101 14 15 0 2 2 5
16 110101 15 16 0 2 2 6
17 110101 16 17 3 2 2 6
18 110101 17 18 3 2 2 6
19 110101 18 19 0 3 2 6
20 110101 19 20 3 3 2 6
21 110101 20 21 3 3 2 6
22 110101 21 22 3 3 2 6
23 110101 22 23 0 4 2 6
24 110101 23 24 2 4 2 6
25 110101 24 25 2 4 2 6
26 110101 25 26 2 4 2 6
27 110101 26 27 0 5 2 6
28 110101 27 28 0 5 2 6
29 110101 28 29 0 5 3 7
30 110101 29 30 1 5 3 7
31 110101 30 31 1 5 3 7
32 110101 31 32 1 5 3 7
33 110101 32 33 1 5 3 7
34 110101 33 34 1 5 3 7
35 110101 34 35 1 5 3 7
36 110101 35 36 1 5 3 7
37 110101 36 37 1 5 3 7
38 110101 37 38 1 5 3 7
39 110101 38 39 1 5 3 7
40 110101 39 40 1 5 3 7
41 110101 40 41 1 5 3 7
42 110101 41 42 1 5 3 7
43 110101 42 43 1 5 3 7
44 110101 43 44 1 5 3 7
45 110101 44 45 1 5 3 7
46 110101 45 46 1 5 3 7
47 110101 46 47 1 5 3 7
48 110101 47 48 1 5 3 7
49 110101 48 49 1 5 3 7
50 110101 49 50 1 5 3 7
51 110101 50 51 1 5 3 7
52 110101 51 52 1 5 3 7
53 110101 52 53 1 5 3 7
54 110101 53 54 1 5 3 7
55 110101 54 55 1 5 3 7
56 110101 55 56 1 5 3 7
57 110101 56 57 1 5 3 7
58 110101 57 58 1 5 3 7
59 110101 58 59 1 5 3 7
60 110101 59 60 1 5 3 7