R编程-将一组由ID索引的时间序列（具有不规则观测周期）拆分为定期月度观测_R_Dataframe_Dplyr

R编程-将一组由ID索引的时间序列（具有不规则观测周期）拆分为定期月度观测

r dataframe

R编程-将一组由ID索引的时间序列（具有不规则观测周期）拆分为定期月度观测,r,dataframe,dplyr,R,Dataframe,Dplyr,我有一组关于用户在r中的data.frame中使用的具有唯一ID的东西的数量的数据 ID start date end date amount 1 1-15-2012 2-15-2012 6000 1 2-15-2012 3-25-2012 4000 1 3-25-2012 5-26-2012 3000 1

我有一组关于用户在r中的data.frame中使用的具有唯一ID的东西的数量的数据

ID        start date         end date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
 ....
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000

每个ID的时间序列在不一致的时间开始和结束，并且包含不一致的观察数。但是，它们都是以上述方式格式化的；开始日期和结束日期是日期对象

我想将每个ID的细分标准化为一个月时间序列，每个月的开始都有数据点，并相应地权衡两个或多个月内观察到的金额数字。换句话说，我想把这个系列变成

ID        start date         end date        amount
1         1-1-2012          2-1-2012       3096 = 6000 * 16/31
1         2-1-2012          3-1-2012       4339 = 6000*15/31+4000*14/39
1         3-1-2012          4-1-2012       etc
 ....
1         6-1-2012          7-1-2012       etc
2         1-1-2012          2-1-2012       etc
2         2-1-2012          3-1-2012       etc
2         3-1-2012          4-1-2012       etc
2         4-1-2012          5-1-2012       etc
2         5-1-2012          6-1-2012       etc
 ....
10000     1-1-2012          2-1-2012       etc
 ....
10000     6-1-2012          7-1-2012       etc

其中，通过将2012年1月15日至2012年2月15日观测中2月着陆的天数（15天/31天）与该观测跨度（6000）中2月15日至3月25日观测跨度中2月着陆的天数进行加权，计算出12月1日至12月3日之间ID 1的值（14天/39天，因为2012年是闰年）乘以该观测跨度（4000）产生6000×15/31＋4000×14/39＝4339。这应该针对每个ID时间序列来完成。我们不考虑观测周期都适合一个月的情况；但是如果它们在两个多月内散布，它们应该在适当的权重下在几个月内分裂。

我对r比较陌生，当然需要一些帮助！

要解决这个问题，我认为最简单的方法是将它分解为两个问题

我怎样才能得到我感兴趣的数字的每日细目？这是我基于您提供的上述信息的假设

如何按日期范围分组并总结出我感兴趣的内容

对于以下示例，我将使用使用以下代码创建的数据集：

df <- data.frame(
  id=c(1,1,1,1,2,2,2),
  start_date=as.Date(c("1-15-2012",
                       "2-15-2012",
                       "3-25-2012",
                       "5-26-2012",
                       "1-16-2012",
                       "2-27-2012",
                       "3-18-2012"), "%m-%d-%Y"),
  end_date=as.Date(c("2-15-2012",
                     "3-25-2012",
                     "5-26-2012",
                     "6-13-2012",
                     "2-27-2012",
                     "3-18-2012",
                     "5-23-2012"), "%m-%d-%Y"),
  amount=c(6000,
           4000,
           3000,
           1000,
           7000,
           2000,
           3000)  
  )

然后，我们将使用开始日期和结束日期来扩展日期范围。有一个，但是看到您使用我们的

dplyr

方式应用

dplyr

标记：

library(dplyr)
df <- df %>%
  rowwise() %>%
  do(data.frame(id=.$id, 
                date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")), 
                daily_contribution=.$daily_contribution))

现在有了所有这些，我们可以很容易地使用

dplyr

按要求的日期总结我们的信息

df %>% 
  group_by(id, mnth, yr) %>%
  summarise(amount=sum(daily_contribution))

输出：

Source: local data frame [11 x 4]
Groups: id, mnth

   id mnth   yr    amount
1   1    1 2012 3290.3226
2   1    2 2012 4441.6873
3   1    3 2012 2902.8122
4   1    4 2012 1451.6129
5   1    5 2012 1591.3978
6   1    6 2012  722.2222
7   2    1 2012 2666.6667
8   2    2 2012 4800.0000
9   2    3 2012 2436.3636
10  2    4 2012 1363.6364
11  2    5 2012 1045.4545

Source: local data frame [11 x 4]
Groups: <by row>

   id start_date   end_date    amount
1   1 2012-01-01 2012-02-01 3290.3226
2   1 2012-02-01 2012-03-01 4441.6873
3   1 2012-03-01 2012-04-01 2902.8122
4   1 2012-04-01 2012-05-01 1451.6129
5   1 2012-05-01 2012-06-01 1591.3978
6   1 2012-06-01 2012-07-01  722.2222
7   2 2012-01-01 2012-02-01 2666.6667
8   2 2012-02-01 2012-03-01 4800.0000
9   2 2012-03-01 2012-04-01 2436.3636
10  2 2012-04-01 2012-05-01 1363.6364
11  2 2012-05-01 2012-06-01 1045.4545

要精确地以指定的格式获取，请执行以下操作：

df %>% rowwise() %>%
  mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
         end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
  select(id, start_date, end_date, amount)

输出：

Source: local data frame [11 x 4]
Groups: id, mnth

   id mnth   yr    amount
1   1    1 2012 3290.3226
2   1    2 2012 4441.6873
3   1    3 2012 2902.8122
4   1    4 2012 1451.6129
5   1    5 2012 1591.3978
6   1    6 2012  722.2222
7   2    1 2012 2666.6667
8   2    2 2012 4800.0000
9   2    3 2012 2436.3636
10  2    4 2012 1363.6364
11  2    5 2012 1045.4545

Source: local data frame [11 x 4]
Groups: <by row>

   id start_date   end_date    amount
1   1 2012-01-01 2012-02-01 3290.3226
2   1 2012-02-01 2012-03-01 4441.6873
3   1 2012-03-01 2012-04-01 2902.8122
4   1 2012-04-01 2012-05-01 1451.6129
5   1 2012-05-01 2012-06-01 1591.3978
6   1 2012-06-01 2012-07-01  722.2222
7   2 2012-01-01 2012-02-01 2666.6667
8   2 2012-02-01 2012-03-01 4800.0000
9   2 2012-03-01 2012-04-01 2436.3636
10  2 2012-04-01 2012-05-01 1363.6364
11  2 2012-05-01 2012-06-01 1045.4545

来源：本地数据帧[11 x 4]
组：
id开始日期结束日期金额
1   1 2012-01-01 2012-02-01 3290.3226
2   1 2012-02-01 2012-03-01 4441.6873
3   1 2012-03-01 2012-04-01 2902.8122
4   1 2012-04-01 2012-05-01 1451.6129
5   1 2012-05-01 2012-06-01 1591.3978
6   1 2012-06-01 2012-07-01  722.2222
7   2 2012-01-01 2012-02-01 2666.6667
8   2 2012-02-01 2012-03-01 4800.0000
9   2 2012-03-01 2012-04-01 2436.3636
10  2 2012-04-01 2012-05-01 1363.6364
11  2 2012-05-01 2012-06-01 1045.4545

根据需要

注意：我可以从您的示例中看出，您有，

3096=6000*16/31

和

439=6000*15/31+4000*14/39

，但对于第一个示例，您有1月15日到1月31日，如果日期范围包括在内，则为17天。如果需要，您可以轻松更改此信息。

这里使用的是原生R:

df %>% 
  group_by(id, mnth, yr) %>%
  summarise(amount=sum(daily_contribution))

#The data
df=read.table(text='ID        start_date         end_date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000',
              header=T,row.names = NULL,stringsAsFactors =FALSE)

df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")

df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot

#using by compute new amount
df1$amount=do.call(c,
  by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
        )
df1

> df1
      ID     startm       endm    amount
1      1 2012-01-01 2012-02-01  3096.774
2      1 2012-02-01 2012-03-01  4339.123
3      1 2012-03-01 2012-05-01  4306.038
4      1 2012-05-01 2012-06-01  1535.842
5      2 2012-01-01 2012-02-01  2500.000
6      2 2012-02-01 2012-03-01  4700.000
7      2 2012-03-01 2012-05-01  3754.545
8  10000 2012-01-01 2012-02-01  5302.326
9  10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000

这里有一个使用

plyr

和

重塑

的解决方案。数字与您提供的不一样，因此我可能误解了您的意图，尽管这似乎满足了您所述的目标（每月金额的加权平均数）

df$索引谢谢！这让我基本上达到了目的！但是，它并没有按每个月进行细分（即，我们得到的是2012-03-01 2012-05-01，而不是3-4月和4-5月）。-对这些行进行分类，并相应地对金额进行拆分！对于间隔超过一个月的情况，也存在这种情况（例如，3-25到5-26，其中包括3月、4月和5月的部分）？在这种情况下，遵循一种方法，按天计算，然后按月累积。这非常简单，正是我要做的。不幸的是，我的数据集相当大（200万个数据点），所以我需要找到有效的方法！（R说这段代码需要106天才能运行！）不幸的是，这需要两天的时间。我的数据集非常大（200-300万个观察值），因此需要提高效率；我正在运行这段代码，20多分钟后它仍然没有完成。您可以尝试和/或。
df$index <- 1:nrow(df) #Create a unique index number

#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")

library(plyr); library(reshape)  #Load the libraries

#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
#    ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) { 
  ID <- X$ID        #Keep the ID value
  n.days <- as.numeric(difftime( X$end.date, X$start.date ))  #Calculate time difference in days, report the result as a number
  day <- seq(X$start.date, X$end.date, by="days")   #Sequence of days
  amount.per.day <- X$amount/n.days      #Amount for that day
  data.frame(ID, day, amount.per.day)    #Last line is the output
})

#Change list back into data.frame
df3 <- ldply(df2, data.frame)   #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <-  as.numeric(format(df3$day, "%m"))   #Assign a month to all dates

#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))

#       ID mon    amount
#    1   1   1 3290.3226
#    2   1   2 4441.6873
#    3   1   3 2902.8122
#    4   1   4 1451.6129
#    5   1   5 1591.3978
#    6   1   6  722.2222
#    7   2   1 2666.6667
#    8   2   2 4800.0000
#    9   2   3 2436.3636
#    10  2   4 1363.6364
#    11  2   5 1045.4545