R 创建新变量，该变量考虑了来自早期记录的先前信息_R_Dplyr_Data.table_Tidyr

R 创建新变量，该变量考虑了来自早期记录的先前信息

R 创建新变量，该变量考虑了来自早期记录的先前信息,r,dplyr,data.table,tidyr,R,Dplyr,Data.table,Tidyr,我有如下数据，我想创建一个新的变量，该变量考虑了上一时期的上述信息。比如说, moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama') name<- c('john', 'john', 'john', 'john', 'john','kate','kate') time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-20

我有如下数据，我想创建一个新的变量，该变量考虑了上一时期的上述信息。比如说,

moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name<- c('john', 'john', 'john', 'john', 'john','kate','kate')
time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018')


df<- data.frame(moviewatched, name, time)

感谢使用

dplyr

的解决方案。我们可以根据

moviewatched

和

name

删除重复的行，计算唯一的

moviewatched

，然后使用

cumsum

计算运行总数

df2

是最终输出

library(dplyr)

df2 <- df %>%
  distinct(moviewatched, name, .keep_all = TRUE) %>%
  group_by(name, time) %>%
  summarise(movietypewatched = n_distinct(moviewatched)) %>%
  mutate(movietypewatched = cumsum(movietypewatched)) %>%
  ungroup()
df2
# # A tibble: 4 x 3
#   name  time   movietypewatched
#   <fct> <fct>             <int>
# 1 john  1-2018                2
# 2 john  2-2018                3
# 3 kate  1-2018                1
# 4 kate  2-2018                2

首先将时间数据转换为类以建立顺序，例如使用

lubridate:：myd

with

truncated=1

。从这里开始，设置行的排列以确保它们是有序的，然后，按

name

分组，使用

purr:：acculate

生成一个到目前为止在

moviewasted

中看到的唯一值列表，调用该列表，

length

将返回到该点看到的电影数量。使用

max

按月聚合，以获得每个月的总累积类型

库（tidyverse）
df%
分组单位（名称）%>%
安排（姓名、时间）%%>%
变异（n_类型=长度（累积（电影观看，~unique（c（…））））%>%
分组人（姓名、时间）%>%
总结（n_类型=最大值（n_类型））
#>#tibble:4 x 3
#>#组：名称[2]
#>名称时间n_类型
#>           
#>1约翰2018-01-01 2
#>2约翰2018-02-01 3
#>3凯特2018-01-01 1
#>4凯特2018-02-01 2

使用

数据。表

：

library(data.table)
df <- unique(df) 
setDT(df)[, movietypewatched := 1:.N, by = c("moviewatched", "name")] 
df <- df[!(movietypewatched == 2), ]
df[, movietypewatched := .N, by = c("name", "time")][, moviewatched := NULL]
df <- unique(df)
df[, movietypewatched := cumsum(movietypewatched), by = name]

   name   time movietypewatched
1: john 1-2018                2
2: john 2-2018                3
3: kate 1-2018                1
4: kate 2-2018                2

库（data.table）
df制作一张第一次观看的日期表；按月统计；并取累计总和：
library(data.table)
setDT(df)

# fix bad date
df[, d := as.IDate(paste(time, "01", sep="-"), "%m-%Y-%d")]

# identify month first watched
fw = df[, .(d = min(d)), by=.(name, moviewatched)]

# count new movies per month
nm = fw[, .N, keyby=.(name, d)]

# take cumulative count
nm[, cN := cumsum(N), by=name]

   name          d N cN
1: john 2018-01-01 2  2
2: john 2018-02-01 1  3
3: kate 2018-01-01 1  1
4: kate 2018-02-01 1  2

您需要转换日期；否则，min（）将不正确和/或损坏
这里有两个聚合步骤，但是由于data.table中的优化，代码应该是快速的（请参见？GForce
）。
在这里，如果您想获得流派中的唯一值以及流派中的计数，可以执行中间步骤
请注意：

您需要按名称、日期
排列数据框以累积值
您可以使用lag（）
获取上一个值。由于每个名称
的第一个条目没有以前的值，因此它将给出NA
使用n\u distinct（）
计算唯一类型时，需要删除NAs

>
库（dplyr）
图书馆（purrr）
图书馆（tidyr）
电影观看率%
变异（类型\所有=map2（类型，滞后（类型），rbind）%>%map（唯一））%>%
解组（）%>%
变异（genre_count=map_int（genre_all，~lift（n_distinct）（.x，na.rm=TRUE）））

结果:
> df_final
# A tibble: 4 x 5
  name  time   genre            genre_all        genre_count
  <fct> <fct>  <list>           <list>                 <int>
1 john  1-2018 <tibble [3 x 1]> <tibble [3 x 1]>           2
2 john  2-2018 <tibble [2 x 1]> <tibble [3 x 1]>           3
3 kate  1-2018 <tibble [1 x 1]> <tibble [2 x 1]>           1
4 kate  2-2018 <tibble [1 x 1]> <tibble [2 x 1]>           2

>df_最终版本
#一个tibble:4x5
名称时间类型类型\u所有类型\u计数
1约翰1-2018 2
2约翰2-2018 3
3凯特1-2018 1
4凯特2-2018 2
wow。我不明白在整个分组中如何区分。。是因为总结
？以及数据表
中的等价物是什么？我不能那样做，尽管我知道这是一条路。例如，df[，uniqueN（moviewatched），by=（time，name）]
将不起作用，因为uniqueN
在分组中。@www.Bravo获取数据表
解决方案。我也不知道有重复的。我将删除我糟糕的答案。感谢againFyi，支持以下语法：duplicated（DT，by=c（“col1”，“col2”）），不过在这种情况下，您应该执行unique（DT，by=c（“col1”，“col2”）），我想。@Frank Good to knowduplicated（DT，by=c（“col1”，“col2”）
。谢谢我怎样才能得到观看新电影的实际数量？例如，2,1,1,1只显示他/她观看的新电影类型，非常感谢。可能df%>%groupu by（name）%%>%arrange（name，time）%%>%mutate（new=c（1，diff）（长度（累计（moviewatched，~unique（c（…）））%%groupu by（name，time）%%>%summary（types=sum（new））
library(data.table)
setDT(df)

# fix bad date
df[, d := as.IDate(paste(time, "01", sep="-"), "%m-%Y-%d")]

# identify month first watched
fw = df[, .(d = min(d)), by=.(name, moviewatched)]

# count new movies per month
nm = fw[, .N, keyby=.(name, d)]

# take cumulative count
nm[, cN := cumsum(N), by=name]

   name          d N cN
1: john 2018-01-01 2  2
2: john 2018-02-01 1  3
3: kate 2018-01-01 1  1
4: kate 2018-02-01 1  2

library(dplyr)
library(purrr)
library(tidyr)

moviewatched <- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name <- c('john', 'john', 'john', 'john','kate','kate', 'john')
time <- c( '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018','1-2018')

df <- data.frame(moviewatched, name, time)


df_final <- df %>% 
  arrange(name, time) %>% 
  group_by(name, time) %>%
  nest(.key= 'genre') %>% 
  group_by(name) %>% 
  mutate(genre_all = map2(genre, lag(genre), rbind) %>% map(unique)) %>% 
  ungroup() %>% 
  mutate(genre_count = map_int(genre_all, ~ lift(n_distinct)(.x, na.rm =TRUE)))

> df_final
# A tibble: 4 x 5
  name  time   genre            genre_all        genre_count
  <fct> <fct>  <list>           <list>                 <int>
1 john  1-2018 <tibble [3 x 1]> <tibble [3 x 1]>           2
2 john  2-2018 <tibble [2 x 1]> <tibble [3 x 1]>           3
3 kate  1-2018 <tibble [1 x 1]> <tibble [2 x 1]>           1
4 kate  2-2018 <tibble [1 x 1]> <tibble [2 x 1]>           2