Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 在计算会话的列中创建替换NAs的函数_R_Function_Session - Fatal编程技术网

R 在计算会话的列中创建替换NAs的函数

R 在计算会话的列中创建替换NAs的函数,r,function,session,R,Function,Session,我有一个数据框,看起来像下面的示例数据框: #sample data frame clientId actual_time session 1 A 2016-11-01 00:00:00 1 2 A 2016-11-01 00:05:00 1 3 A 2016-11-01 00:35:01 2 4 A 2016-11-01 00:40:00 NA 5 A

我有一个数据框,看起来像下面的示例数据框:

#sample data frame
   clientId actual_time           session
1  A        2016-11-01 00:00:00   1             
2  A        2016-11-01 00:05:00   1
3  A        2016-11-01 00:35:01   2
4  A        2016-11-01 00:40:00   NA
5  A        2016-11-01 01:10:01   NA         
6  B        2016-11-01 01:00:00   1
7  B        2016-11-01 01:05:00   1
8  B        2016-11-01 01:30:00   1
9  B        2016-11-01 01:40:00   1
10 B        2016-11-01 01:50:00   NA
11 C        2016-11-01 02:00:00   NA
12 C        2016-11-01 02:35:00   NA
13 C        2016-11-01 04:35:00   NA
我想用逻辑定义如下的值填充“会话”列中的NAs:

  • 对于相同的“clientId”,如果两个后续行之间的时差大于等于30分钟,则较新的行将处于新会话中(等于较旧行的会话加1);如果随后两行之间的时间差小于30分钟,则两行都处于具有相同会话号的相同会话中
  • 会话号是从1开始的累积数,即,对于新的clientId,会话号从1开始
NA填满后,数据框将如下所示:

#sample data frame (result)
   clientId actual_time           session
1  A        2016-11-01 00:00:00   1             
2  A        2016-11-01 00:05:00   1
3  A        2016-11-01 00:35:01   2
4  A        2016-11-01 00:40:00   2
5  A        2016-11-01 01:10:00   3         
6  B        2016-11-01 01:00:00   1
7  B        2016-11-01 01:05:00   1
8  B        2016-11-01 01:30:00   1
9  B        2016-11-01 01:40:00   1
10 B        2016-11-01 01:50:00   1
11 C        2016-11-01 02:00:00   1
12 C        2016-11-01 02:35:00   2
13 C        2016-11-01 04:35:00   3
我试过:

df<-data.frame(clientId=c(rep('A',5),rep('B',5),rep('C',3)),
       actual_time=as.POSIXct(c("2016-11-01 00:00:00","2016-11-01 00:05:00","2016-11-01 00:35:01","2016-11-01 00:40:00","2016-11-01 01:10:01",
                       "2016-11-01 01:00:00","2016-11-01 01:05:00","2016-11-01 01:30:00","2016-11-01 01:40:00","2016-11-01 01:50:00",
                       "2016-11-01 02:00:00","2016-11-01 02:35:00","2016-11-01 04:35:00")),
       session=c(1,1,2,NA,NA,1,1,1,1,NA,NA,NA,NA))  

my_session<- function(df){

  for (i in 2:(dim(df)[1])){
    if(is.na(df$session[i])){
      if (df$clientId[i]==df$clientId[i-1]){
        if(as.numeric(difftime(df$actual_time[i], 
                               df$actual_time[i-1], Asia/Taipei,units =     "mins"))>30){
          df$session[i]<- df$session[i-1]+1
        }else{df$session[i]<- df$session[i-1]}
      }else{df$session[i]<- 1}
    }
  }

  return(df)
}

df2<-my_session(df)

df我将提出一种
data.table
方法,它应该比您现有的函数扩展得更好

library(data.table)
DT <- as.data.table(df) # or setDT(df)
DT[, session := cumsum(difftime(actual_time, shift(actual_time, 
               fill = min(actual_time)), units = "mins") > 30) +1L, 
    by = clientId]
我正在使用ddply()来解决这个问题

df$actual_time <- as.POSIXct(df$actual_time)
library(plyr)
ddply(df, .(clientId),transform, x2 = c(0,cumsum(diff(actual_time) > 30))+1 )

     clientId         actual_time session x2
1         A 2016-11-01 00:00:00       1  1
2         A 2016-11-01 00:05:00       1  1
3         A 2016-11-01 00:35:01       2  2
4         A 2016-11-01 00:40:00      NA  2
5         A 2016-11-01 01:10:01      NA  3
6         B 2016-11-01 01:00:00       1  1
7         B 2016-11-01 01:05:00       1  1
8         B 2016-11-01 01:30:00       1  1
9         B 2016-11-01 01:40:00       1  1
10        B 2016-11-01 01:50:00      NA  1
11        C 2016-11-01 02:00:00      NA  1
12        C 2016-11-01 02:35:00      NA  2
13        C 2016-11-01 04:35:00      NA  3
df$actual_time 30))+1)
客户端ID实际时间会话x2
1A 2016-11-01 00:00:00 11
2A 2016-11-01 00:05:00 1
3A 2016-11-01 00:35:01 2
4a 2016-11-01 00:40:00北美2
5A 2016-11-01 01:10:01 NA 3
6b 2016-11-01 01:00:00 11
7 B 2016-11-01 01:05:00 1 1
8 B 2016-11-01 01:30:00 11
9 B 2016-11-01 01:40:00 11
10B 2016-11-0101:50:00NA 1
11 C 2016-11-01 02:00:00北美1
12 C 2016-11-01 02:35:00北美2
13 C 2016-11-01 04:35:00北美3

我建议您使用split函数将dataframe分解为每个dataframe对应于相同clientId的dataframe列表,并使用Lappy在列表上迭代:

dat.split <- split(x = sample.data, f = as.factor(sample.data$clientId))
replace.nas <- lapply(dat.split, function(df) { 
                        # Fix the na problem here 
                        # return fixed dataframe})

dat.final <- do.call(rbind.data.frame, replace.nas)

dat.split您还应该尝试使用“is.na”和“which”函数查找NAs的行号:谢谢您提供的解决方案。即使它没有处理NA问题并直接重新计算列。它确实工作得很快。无论如何,我将使用此解决方案在新数据进入时重新计算会话列。
dat.split <- split(x = sample.data, f = as.factor(sample.data$clientId))
replace.nas <- lapply(dat.split, function(df) { 
                        # Fix the na problem here 
                        # return fixed dataframe})

dat.final <- do.call(rbind.data.frame, replace.nas)