R 如何按组有条件地对另一列执行连续的列计数_R_Sum_Aggregate Functions_Feature Extraction_Feature Selection

R 如何按组有条件地对另一列执行连续的列计数

R 如何按组有条件地对另一列执行连续的列计数,r,sum,aggregate-functions,feature-extraction,feature-selection,R,Sum,Aggregate Functions,Feature Extraction,Feature Selection,我试图从PatientID列分组的Noshow列中获取连续计数。下面我使用的代码非常接近我希望获得的结果。但是，使用sum函数返回整个组的和。我希望sum函数只对当前行和上面有“1”的行求和。基本上，我是在计算患者每行不显示预约的连续次数，然后在显示时重置为0。似乎只需要对我下面的代码做一些调整。然而，我似乎在这个网站的任何地方都找不到答案 transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FU

我试图从PatientID列分组的Noshow列中获取连续计数。下面我使用的代码非常接近我希望获得的结果。但是，使用sum函数返回整个组的和。我希望sum函数只对当前行和上面有“1”的行求和。基本上，我是在计算患者每行不显示预约的连续次数，然后在显示时重置为0。似乎只需要对我下面的代码做一些调整。然而，我似乎在这个网站的任何地方都找不到答案

transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))

上述代码生成以下输出：

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
#   PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>   
#1          1      0             0
#2          1      1             4
#3          1      0             0
#4          1      1             4
#5          1      1             4
#6          1      1             4
#7          2      0             0
#8          2      0             0
#9          2      1             3
#10         2      1             3
#11         2      0             0
#12         2      1             3

#来源：本地数据帧[12 x 3]
#组别:ID[2]
#
#病人无症状
#                     
#1          1      0             0
#2          1      1             4
#3          1      0             0
#4          1      1             4
#5          1      1             4
#6          1      1             4
#7          2      0             0
#8          2      0             0
#9          2      1             3
#10         2      1             3
#11         2      0             0
#12         2      1             3

这就是我想要的：

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
#   PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>   
#1          1      0             0
#2          1      1             0
#3          1      0             1
#4          1      1             0
#5          1      1             1
#6          1      1             2
#7          2      0             0
#8          2      0             0
#9          2      1             0
#10         2      1             1
#11         2      0             2
#12         2      1             0

#来源：本地数据帧[12 x 3]
#组别:ID[2]
#
#病人无症状
#                     
#1          1      0             0
#2          1      1             0
#3          1      0             1
#4          1      1             0
#5          1      1             1
#6          1      1             2
#7          2      0             0
#8          2      0             0
#9          2      1             0
#10         2      1             1
#11         2      0             2
#12         2      1             0

[更新]我希望连续计数向下偏移一行

感谢您提前提供的任何帮助

对连续值进行分组的最直接方法是使用

data.table

中的

rleid

，这里是

data.table

包中的一个选项，您可以通过

PatientID

以及

Noshow

变量的

rleid

对数据进行分组。您还需要使用

cumsum

函数来获取

Noshow

变量的累积和，而不是

sum

：

library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
#    PatientID Noshow ConsecNoshows
# 1:         1      0             0
# 2:         1      1             1
# 3:         1      0             0
# 4:         1      1             1
# 5:         1      1             2
# 6:         1      1             3
# 7:         2      0             0
# 8:         2      0             0
# 9:         2      1             1
#10:         2      1             2
#11:         2      0             0
#12:         2      1             1

分组连续值最直接的方法是使用

data.table

中的

rleid

，这里是

data.table

包中的一个选项，您可以通过

PatientID

以及

Noshow

变量的

rleid

对数据进行分组。您还需要使用

cumsum

函数来获取

Noshow

变量的累积和，而不是

sum

：

library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
#    PatientID Noshow ConsecNoshows
# 1:         1      0             0
# 2:         1      1             1
# 3:         1      0             0
# 4:         1      1             1
# 5:         1      1             2
# 6:         1      1             3
# 7:         2      0             0
# 8:         2      0             0
# 9:         2      1             1
#10:         2      1             2
#11:         2      0             0
#12:         2      1             1

下面是另一种（类似的）

data.table方法
library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
#     PatientID Noshow ConsecNoshows
#  1:         1      0             0
#  2:         1      1             1
#  3:         1      0             0
#  4:         1      1             1
#  5:         1      1             2
#  6:         1      1             3
#  7:         2      0             0
#  8:         2      0             0
#  9:         2      1             1
# 10:         2      1             2
# 11:         2      0             0
# 12:         2      1             1

这基本上是通过PatientID
和Noshow
的“运行长度编码”进行分组，并在乘以Noshow
时使用组大小创建序列，以便仅保留Noshow==1
时的值，这是另一种（类似的）数据。表
方法
library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
#     PatientID Noshow ConsecNoshows
#  1:         1      0             0
#  2:         1      1             1
#  3:         1      0             0
#  4:         1      1             1
#  5:         1      1             2
#  6:         1      1             3
#  7:         2      0             0
#  8:         2      0             0
#  9:         2      1             1
# 10:         2      1             2
# 11:         2      0             0
# 12:         2      1             1

这基本上是通过PatientID
和Noshow
的“运行长度编码”进行分组，并在乘以Noshow
时使用组大小创建序列，以便仅在Noshow==1
时保留值。我们可以使用baser
中的rle
（不使用包）。使用ave
，我们按“PatientID”分组，得到“Noshow”的rle
，将“length”的序列
乘以“length”复制的“value”得到预期输出
helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows 
#[1] 0 1 0 1 2 3 0 0 1 2 0 1

我们可以使用baser
中的rle
（不使用软件包）。使用ave
，我们按“PatientID”分组，得到“Noshow”的rle
，将“length”的序列
乘以“length”复制的“value”得到预期输出
helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows 
#[1] 0 1 0 1 2 3 0 0 1 2 0 1

我将创建一个助手函数，然后使用您最熟悉的任何实现：
sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}

#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))

#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))

#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
  #    PatientID Noshow Consec
  #        <int>  <int>  <int>
  # 1          1      0      0
  # 2          1      1      1
  # 3          1      0      0
  # 4          1      1      1
  # 5          1      1      2
  # 6          1      1      3
  # 7          2      0      0
  # 8          2      0      0
  # 9          2      1      1
  # 10         2      1      2
  # 11         2      0      0
  # 12         2      1      1

sum0%group_by（PatientID）%%>%mutate（conce=sum0（Noshow））
#数据表
库（数据表）
setDT（df1）[，conce:=sum0（Noshow），by=PatientID]
#帕提提提德·诺绍·康斯
#            
# 1          1      0      0
# 2          1      1      1
# 3          1      0      0
# 4          1      1      1
# 5          1      1      2
# 6          1      1      3
# 7          2      0      0
# 8          2      0      0
# 9          2      1      1
# 10         2      1      2
# 11         2      0      0
# 12         2      1      1
我会创建一个helper函数，然后使用您最熟悉的任何实现：
sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}

#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))

#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))

#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
  #    PatientID Noshow Consec
  #        <int>  <int>  <int>
  # 1          1      0      0
  # 2          1      1      1
  # 3          1      0      0
  # 4          1      1      1
  # 5          1      1      2
  # 6          1      1      3
  # 7          2      0      0
  # 8          2      0      0
  # 9          2      1      1
  # 10         2      1      2
  # 11         2      0      0
  # 12         2      1      1

sum0%group_by（PatientID）%%>%mutate（conce=sum0（Noshow））
#数据表
库（数据表）
setDT（df1）[，conce:=sum0（Noshow），by=PatientID]
#帕提提提德·诺绍·康斯
#            
# 1          1      0      0
# 2          1      1      1
# 3          1      0      0
# 4          1      1      1
# 5          1      1      2
# 6          1      1      3
# 7          2      0      0
# 8          2      0      0
# 9          2      1      1
# 10         2      1      2
# 11         2      0      0
# 12         2      1      1
我希望我能勾选您所有的解决方案，因为它们都完全符合我的要求。非常感谢你的帮助！我希望我能勾选你所有的解决方案，因为它们都完全符合我的要求。非常感谢你的帮助！David，我检查了你的答案是否正确，因为你提供了完成工作所需的最短代码。在对我的模型执行了更多工作后，我发现我实际上需要将连续计数结果抵消1。因此，如果前两行被算作noshow，那么第三行s