Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 在具有多个观察期的数据框中添加缺少的日期值_R_Date_For Loop_Merge_Sequence - Fatal编程技术网

R 在具有多个观察期的数据框中添加缺少的日期值

R 在具有多个观察期的数据框中添加缺少的日期值,r,date,for-loop,merge,sequence,R,Date,For Loop,Merge,Sequence,提前谢谢 我试图为三个不同的个体添加未包含在观察期内的缺失日期值 我的数据如下所示: IndID Date Event Number Percent 1 P01 2011-03-04 1 2 0.390 2 P01 2011-03-11 1 2 0.975 3 P01 2011-03-13 0 9 0.795 4 P01 2011-03-14 0 10 0.516 5 P01

提前谢谢

我试图为三个不同的个体添加未包含在观察期内的缺失日期值

我的数据如下所示:

 IndID       Date Event Number Percent
1   P01 2011-03-04     1      2   0.390
2   P01 2011-03-11     1      2   0.975
3   P01 2011-03-13     0      9   0.795
4   P01 2011-03-14     0     10   0.516
5   P01 2011-03-15     0      1   0.117
6   P01 2011-03-17     0      7   0.093
IndID
是个人ID(
P01
P03
P06
)<代码>日期显然就是日期<代码>事件是一个二进制变量,指示事件是否发生(
0
=否和
1
=是)。
Number
Percent
不直接相关,但需要保留,因此包含在此处

我的示例数据帧(
PostData
)包含在下面,使用的是
dput

对于每个
IndID
而言,第一个和最后一个
日期分别是观察期的开始和结束,在观察期内有缺失的日期。在这里,我的目标是为每个人添加缺失的日期,并在
事件
列中添加一个
0
。其他列(
Number
Percent
)可以保留为空

已经很有用了,但是缺少关于我主要问题的信息——多个人

每个个体的观察期从
min(PostData$Date)
max(PostData$Date)
。我一直在尝试为每个人创建一个完整的日期序列,然后将其与
for
循环中的现有数据帧合并。肯定有更好的办法

如有任何建议,我们将不胜感激

PostData <-structure(list(IndID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
  3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
  5L, 5L), .Label = c("P01", "P02", "P03", "P05", "P06", "P07", 
  "P08", "P09", "P10", "P11", "P12", "P13"), class = "factor"), 
  Date = structure(c(1299196800, 1299801600, 1299974400, 1300060800, 
  1300147200, 1300320000, 1300406400, 1310083200, 1310169600, 
  1310515200, 1310774400, 1310947200, 1311033600, 1311292800, 
  1311552000, 1323129600, 1323388800, 1323648000, 1323993600, 
  1324080000, 1324166400, 1324339200, 1327622400, 1327795200, 
  1327881600), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
  Event = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 
  0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L), Number = c(2L, 
  2L, 9L, 10L, 1L, 7L, 5L, 9L, 1L, 4L, 5L, 2L, 0L, 1L, 10L, 
  5L, 0L, 6L, 5L, 10L, 9L, 4L, 4L, 8L, 1L), Percent = c(0.39, 
  0.975, 0.795, 0.516, 0.117, 0.093, 0.528, 0.659, 0.308, 0.055, 
  0.185, 0.761, 0.132, 0.676, 0.368, 0.383, 0.272, 0.113, 0.974, 
  0.696, 0.941, 0.751, 0.758, 0.29, 0.15)), .Names = c("IndID", 
  "Date", "Event", "Number", "Percent"), row.names = c(NA, 25L), 
  class = "data.frame")

PostData试试这个。。这将添加具有正确ID的缺失日期,剩余字段为0

library(data.table)
library(plyr)
dtPostData = data.table(PostData)
minmaxTab = dtPostData[,list(minDate=min(Date),maxDate=max(Date)),by=IndID]

df = lapply(1:nrow(minmaxTab),function(x) {
  temp = seq(minmaxTab$minDate[x],minmaxTab$maxDate[x],by=24*60*60) 
  temp = temp[!(temp %in% dtPostData[IndID == minmaxTab$IndID[x],]$Date)]
  data.table(IndID = minmaxTab$IndID[x], Date = temp, Event = 0, Number = 0, Percent = 0)
})

df <- ldply(x, data.frame)
df

#Results
   IndID       Date Event Number Percent
1    P01 2011-03-05     0      0       0
2    P01 2011-03-06     0      0       0
3    P01 2011-03-07     0      0       0
4    P01 2011-03-08     0      0       0
5    P01 2011-03-09     0      0       0
6    P01 2011-03-10     0      0       0
7    P01 2011-03-12     0      0       0
8    P01 2011-03-16     0      0       0
9    P03 2011-07-10     0      0       0
库(data.table)
图书馆(plyr)
dtPostData=data.table(PostData)
minmaxTab=dtPostData[,list(minDate=min(Date),maxDate=max(Date)),by=IndID]
df=lappy(1:nrow(minmaxTab),函数(x){
temp=seq(minmaxTab$minDate[x],minmaxTab$maxDate[x],by=24*60*60)
temp=temp[!(temp%在%dtPostData[IndID==minmaxTab$IndID[x],]$Date中)
data.table(IndID=minmaxTab$IndID[x],日期=temp,事件=0,数字=0,百分比=0)
})
dfA基本R版本:

do.call(rbind,
  by(
    PostData,
    PostData$IndID,
    function(x) {
      out <- merge(
        data.frame(
          IndID=x$IndID[1],
          Date=seq.POSIXt(min(x$Date),max(x$Date),by="1 day")
        ),
        x,
        all.x=TRUE
      )
      out$Event[is.na(out$Event)] <- 0
      out
    }  
  )
)
do.call(rbind,
借(
PostData,
PostData$IndID,
功能(x){

out计算最小和最大时间(从历元起的秒数):

使用序列生成缺失日期的列表:

list_of_dates = seq(min_time,max_time, 86400) #since there are 86400 seconds in a day
list_of_dates = as.Date(as.POSIXct( list_of_dates ), origin = '1970-01-01 00:00.00 UTC') 
#convert back to a date
构建缺少IndID和Date组合的列表

temp = merge(unique(PostData$IndID),list_of_dates)
names(temp) = c("IndID","Date")
data_missing_indID_date = temp[!which(temp$IndID %in% PostData$IndID & temp$Date %in% PostData$Date ),]
构建其余的列:

data_missing_indID_date$Event = 0 
data_missing_indID_date$Number = NA
data_missing_indID_date$Percent = NA
rbind
将其绑定到原始数据帧:

final_data = rbind(PostData, data_missing_indID_date)

她的是一个
dplyr
解决方案。基于示例数据,结果是一个包含89行的data.frame,我希望这就是您想要的结果

require(dplyr)

PostData %>%
  mutate(Date = as.Date(as.character(Date))) %>%
  group_by(IndID) %>%
  do(left_join(data.frame(IndID = .$IndID[1], Date = seq(min(.$Date), max(.$Date), 1)), ., 
                       by=c("IndID", "Date"))) %>%
  mutate(Event = ifelse(is.na(Event), 0, Event))

#   IndID       Date Event Number Percent
#1    P01 2011-03-04     1      2   0.390
#2    P01 2011-03-05     0     NA      NA
#3    P01 2011-03-06     0     NA      NA
#4    P01 2011-03-07     0     NA      NA
#5    P01 2011-03-08     0     NA      NA
#6    P01 2011-03-09     0     NA      NA 
#7    P01 2011-03-10     0     NA      NA
#8    P01 2011-03-11     1      2   0.975
#...
#84   P06 2012-01-25     0     NA      NA
#85   P06 2012-01-26     0     NA      NA
#86   P06 2012-01-27     1      4   0.758
#87   P06 2012-01-28     0     NA      NA
#88   P06 2012-01-29     0      8   0.290
#89   P06 2012-01-30     0      1   0.150

感谢您提供了非常有用的解决方案!结果非常有用。
中的
做了什么(left_join(data.frame(IndID=.$IndID[1],Date=seq(min(.$Date),max(.$Date),1)),,by=c(“IndID”,“Date”))
做了什么?还有,为什么选择
$IndID[1]
dplyr中使用了
do
操作符对分组数据应用任意函数。您可以通过在控制台上键入
?do
了解更多信息。我使用了
$IndID[1]
在创建新的数据框时,在每组数据中都有ID变量,我只选择了每个ID的第一个匹配项,这样就可以按照日期序列的长度循环使用。
final_data = rbind(PostData, data_missing_indID_date)
require(dplyr)

PostData %>%
  mutate(Date = as.Date(as.character(Date))) %>%
  group_by(IndID) %>%
  do(left_join(data.frame(IndID = .$IndID[1], Date = seq(min(.$Date), max(.$Date), 1)), ., 
                       by=c("IndID", "Date"))) %>%
  mutate(Event = ifelse(is.na(Event), 0, Event))

#   IndID       Date Event Number Percent
#1    P01 2011-03-04     1      2   0.390
#2    P01 2011-03-05     0     NA      NA
#3    P01 2011-03-06     0     NA      NA
#4    P01 2011-03-07     0     NA      NA
#5    P01 2011-03-08     0     NA      NA
#6    P01 2011-03-09     0     NA      NA 
#7    P01 2011-03-10     0     NA      NA
#8    P01 2011-03-11     1      2   0.975
#...
#84   P06 2012-01-25     0     NA      NA
#85   P06 2012-01-26     0     NA      NA
#86   P06 2012-01-27     1      4   0.758
#87   P06 2012-01-28     0     NA      NA
#88   P06 2012-01-29     0      8   0.290
#89   P06 2012-01-30     0      1   0.150