R 添加具有多个条件的虚拟变量
我有一个名为“insider_dataset”的数据框架,其组成如下:R 添加具有多个条件的虚拟变量,r,dataframe,R,Dataframe,我有一个名为“insider_dataset”的数据框架,其组成如下: personid cusip6 acqdisp trandate month year <dbl> <chr> <chr> <date> <dbl> <dbl> 1 13080542 143436 D 2000-01-03 1 2000 2 12260711 143436 A 2002-01
personid cusip6 acqdisp trandate month year
<dbl> <chr> <chr> <date> <dbl> <dbl>
1 13080542 143436 D 2000-01-03 1 2000
2 12260711 143436 A 2002-01-07 1 2002
3 12700206 143436 D 2010-10-03 10 2010
4 7161 382388 A 2011-09-03 9 2011
5 7161 382388 A 2012-09-08 9 2012
6 7161 382388 A 2013-09-03 9 2013
personid cusip6 acqdisp传输日期月份年份
13080542143436 D 2000-01-03 1 2000
2 1226071114436A 2002-01-07 1 2002
312700206 143436 D 2010-10-03 2010
4161382388A 2011-09-039 2011
57161382388A 2012-09-089 2012
67161382388A 2013-09-039 2013
我的目标是添加一个名为“routine_dummy”的虚拟变量,如果personid在trandate之前的两年中使用相同的acqdisp和相同的月份出现,则该变量等于1,否则等于零。在本例中,虚拟变量仅在第6行中等于1。我尝试的是:
while (i <= nrow(insider_dataset)) {
if (nrow(subset(insider_dataset, personid == insider_dataset$personid[i]
& month == month[i] & year == year[i]-1 | year == year[i]-2 & acqdisp ==
acqdisp[i])) > 1) {
insider_dataset$routine_dummy[i] <- 1
}
else insider_dataset$routine_dummy[i] <- 0
i <- i+1
}
while(i 1){
insider_dataset$routine_dummy[i]循环在R
中效率不高,这就是为什么您需要选择矢量化操作(使用data.table
,如果您真的想要一些速度)。此外,如果您想检查它是否存在于前两年,请更改any
=>all
tidyverse
库(tidyverse)
dt%>%
分组依据(个人、acqdisp、月份)%>%
变异(例程_dummy=+sapply(年份[行号()],函数(x)任意((x-1:2)%in%年份)))
#一个tibble:7x7
#分组:personid、acqdisp、月份[4]
personid cusip6 acqdisp传输日期月例行程序
13080542143436 D 2000-01-03 1 0 2000
2 1226071114436A 2002-01-07 1 0 2002
312700206 143436 D 2010-10-03 2010年10月1日
4161382388A 2011-09-03902011
57161382388A 2012-09-08 9 1 2012
67161382388A 2013-09-03 9 1 2013
7 12700206143436 D 2008-10-03102008
数据表
库(data.table)
setDT(dt)
dt[,,
例行程序_dummy:=sapply(1:N,函数(I)+any((年[I]-1:2)%in%年)),by=(personid,acqdisp,月)
][]
personid cusip6 acqdisp传输日期月例行程序
1:13080542143436 D 2000-01-03 1 0 2000
2:1226071114436A2002-01-07102002
3:12700206143436 D 2010-10-03101010
4:71613823882011-09-0392011
5:716138238812012-09-08912012
6:7161 382388 A 2013-09-03 9 1 2013
7:12700206143436 D 2008-10-03102008
循环在R
中效率不高,这就是为什么您需要选择矢量化操作(如果您真的想提高速度,请使用data.table
)。此外,如果您想检查前两年是否存在循环,请更改任何=>all
tidyverse
库(tidyverse)
dt%>%
分组依据(个人、acqdisp、月份)%>%
变异(例程_dummy=+sapply(年份[行号()],函数(x)任意((x-1:2)%in%年份)))
#一个tibble:7x7
#分组:personid、acqdisp、月份[4]
personid cusip6 acqdisp传输日期月例行程序
13080542143436 D 2000-01-03 1 0 2000
2 1226071114436A 2002-01-07 1 0 2002
312700206 143436 D 2010-10-03 2010年10月1日
4161382388A 2011-09-03902011
57161382388A 2012-09-08 9 1 2012
67161382388A 2013-09-03 9 1 2013
7 12700206143436 D 2008-10-03102008
数据表
库(data.table)
setDT(dt)
dt[,,
例行程序_dummy:=sapply(1:N,函数(I)+any((年[I]-1:2)%in%年)),by=(personid,acqdisp,月)
][]
personid cusip6 acqdisp传输日期月例行程序
1:13080542143436 D 2000-01-03 1 0 2000
2:1226071114436A2002-01-07102002
3:12700206143436 D 2010-10-03101010
4:71613823882011-09-0392011
5:716138238812012-09-08912012
6:7161 382388 A 2013-09-03 9 1 2013
7:12700206143436 D 2008-10-03102008
这里是另一个数据表
解决方案,我认为会相对较快
library(data.table)
setDT(insider_dataset)
insider_dataset[, diff := c(0, diff(year)), by = c('personid', 'acqdisp', 'month')][,
routine_dummy := +(diff == 1 & lag(diff) == 1), by = c('personid', 'acqdisp', 'month')]
insider_dataset
输出
personid cusip6 acqdisp trandate month year diff routine_dummy
1: 13080542 143436 D 2000-01-03 1 2000 0 0
2: 12260711 143436 A 2002-01-07 1 2002 0 0
3: 12700206 143436 D 2010-10-03 10 2010 0 0
4: 7161 382388 A 2011-09-03 9 2011 0 0
5: 7161 382388 A 2012-09-08 9 2012 1 0
6: 7161 382388 A 2013-09-03 9 2013 1 1
下面是另一个data.table
解决方案,我认为它会相对较快
library(data.table)
setDT(insider_dataset)
insider_dataset[, diff := c(0, diff(year)), by = c('personid', 'acqdisp', 'month')][,
routine_dummy := +(diff == 1 & lag(diff) == 1), by = c('personid', 'acqdisp', 'month')]
insider_dataset
输出
personid cusip6 acqdisp trandate month year diff routine_dummy
1: 13080542 143436 D 2000-01-03 1 2000 0 0
2: 12260711 143436 A 2002-01-07 1 2002 0 0
3: 12700206 143436 D 2010-10-03 10 2010 0 0
4: 7161 382388 A 2011-09-03 9 2011 0 0
5: 7161 382388 A 2012-09-08 9 2012 1 0
6: 7161 382388 A 2013-09-03 9 2013 1 1