Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/image-processing/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 添加具有多个条件的虚拟变量_R_Dataframe - Fatal编程技术网

R 添加具有多个条件的虚拟变量

R 添加具有多个条件的虚拟变量,r,dataframe,R,Dataframe,我有一个名为“insider_dataset”的数据框架,其组成如下: personid cusip6 acqdisp trandate month year <dbl> <chr> <chr> <date> <dbl> <dbl> 1 13080542 143436 D 2000-01-03 1 2000 2 12260711 143436 A 2002-01

我有一个名为“insider_dataset”的数据框架,其组成如下:

personid cusip6 acqdisp trandate   month  year
     <dbl> <chr>  <chr>   <date>     <dbl> <dbl>
1 13080542 143436 D       2000-01-03     1  2000
2 12260711 143436 A       2002-01-07     1  2002
3 12700206 143436 D       2010-10-03    10  2010
4     7161 382388 A       2011-09-03     9  2011
5     7161 382388 A       2012-09-08     9  2012
6     7161 382388 A       2013-09-03     9  2013
personid cusip6 acqdisp传输日期月份年份
13080542143436 D 2000-01-03 1 2000
2 1226071114436A 2002-01-07 1 2002
312700206 143436 D 2010-10-03 2010
4161382388A 2011-09-039 2011
57161382388A 2012-09-089 2012
67161382388A 2013-09-039 2013
我的目标是添加一个名为“routine_dummy”的虚拟变量,如果personid在trandate之前的两年中使用相同的acqdisp和相同的月份出现,则该变量等于1,否则等于零。在本例中,虚拟变量仅在第6行中等于1。我尝试的是:

while (i <= nrow(insider_dataset)) {
  if (nrow(subset(insider_dataset, personid == insider_dataset$personid[i]
                  & month == month[i] & year == year[i]-1 | year == year[i]-2 & acqdisp == 
                  acqdisp[i])) > 1) {
    insider_dataset$routine_dummy[i] <- 1
  }
  else insider_dataset$routine_dummy[i] <- 0
  i <- i+1
}
while(i 1){

insider_dataset$routine_dummy[i]循环在
R
中效率不高,这就是为什么您需要选择矢量化操作(使用
data.table
,如果您真的想要一些速度)。此外,如果您想检查它是否存在于前两年,请更改
any
=>all

tidyverse
库(tidyverse)
dt%>%
分组依据(个人、acqdisp、月份)%>%
变异(例程_dummy=+sapply(年份[行号()],函数(x)任意((x-1:2)%in%年份)))
#一个tibble:7x7
#分组:personid、acqdisp、月份[4]
personid cusip6 acqdisp传输日期月例行程序
13080542143436 D 2000-01-03 1 0 2000
2 1226071114436A 2002-01-07 1 0 2002
312700206 143436 D 2010-10-03 2010年10月1日
4161382388A 2011-09-03902011
57161382388A 2012-09-08 9 1 2012
67161382388A 2013-09-03 9 1 2013
7 12700206143436 D 2008-10-03102008
数据表
库(data.table)
setDT(dt)
dt[,,
例行程序_dummy:=sapply(1:N,函数(I)+any((年[I]-1:2)%in%年)),by=(personid,acqdisp,月)
][]
personid cusip6 acqdisp传输日期月例行程序
1:13080542143436 D 2000-01-03 1 0 2000
2:1226071114436A2002-01-07102002
3:12700206143436 D 2010-10-03101010
4:71613823882011-09-0392011
5:716138238812012-09-08912012
6:7161 382388 A 2013-09-03 9 1 2013
7:12700206143436 D 2008-10-03102008

循环在
R
中效率不高,这就是为什么您需要选择矢量化操作(如果您真的想提高速度,请使用
data.table
)。此外,如果您想检查前两年是否存在循环,请更改
任何
=>all

tidyverse
库(tidyverse)
dt%>%
分组依据(个人、acqdisp、月份)%>%
变异(例程_dummy=+sapply(年份[行号()],函数(x)任意((x-1:2)%in%年份)))
#一个tibble:7x7
#分组:personid、acqdisp、月份[4]
personid cusip6 acqdisp传输日期月例行程序
13080542143436 D 2000-01-03 1 0 2000
2 1226071114436A 2002-01-07 1 0 2002
312700206 143436 D 2010-10-03 2010年10月1日
4161382388A 2011-09-03902011
57161382388A 2012-09-08 9 1 2012
67161382388A 2013-09-03 9 1 2013
7 12700206143436 D 2008-10-03102008
数据表
库(data.table)
setDT(dt)
dt[,,
例行程序_dummy:=sapply(1:N,函数(I)+any((年[I]-1:2)%in%年)),by=(personid,acqdisp,月)
][]
personid cusip6 acqdisp传输日期月例行程序
1:13080542143436 D 2000-01-03 1 0 2000
2:1226071114436A2002-01-07102002
3:12700206143436 D 2010-10-03101010
4:71613823882011-09-0392011
5:716138238812012-09-08912012
6:7161 382388 A 2013-09-03 9 1 2013
7:12700206143436 D 2008-10-03102008

这里是另一个
数据表
解决方案,我认为会相对较快

library(data.table)

setDT(insider_dataset)

insider_dataset[, diff := c(0, diff(year)), by = c('personid', 'acqdisp', 'month')][,
  routine_dummy := +(diff == 1 & lag(diff) == 1), by = c('personid', 'acqdisp', 'month')]

insider_dataset
输出

   personid cusip6 acqdisp   trandate month year diff routine_dummy
1: 13080542 143436       D 2000-01-03     1 2000    0             0
2: 12260711 143436       A 2002-01-07     1 2002    0             0
3: 12700206 143436       D 2010-10-03    10 2010    0             0
4:     7161 382388       A 2011-09-03     9 2011    0             0
5:     7161 382388       A 2012-09-08     9 2012    1             0
6:     7161 382388       A 2013-09-03     9 2013    1             1

下面是另一个
data.table
解决方案,我认为它会相对较快

library(data.table)

setDT(insider_dataset)

insider_dataset[, diff := c(0, diff(year)), by = c('personid', 'acqdisp', 'month')][,
  routine_dummy := +(diff == 1 & lag(diff) == 1), by = c('personid', 'acqdisp', 'month')]

insider_dataset
输出

   personid cusip6 acqdisp   trandate month year diff routine_dummy
1: 13080542 143436       D 2000-01-03     1 2000    0             0
2: 12260711 143436       A 2002-01-07     1 2002    0             0
3: 12700206 143436       D 2010-10-03    10 2010    0             0
4:     7161 382388       A 2011-09-03     9 2011    0             0
5:     7161 382388       A 2012-09-08     9 2012    1             0
6:     7161 382388       A 2013-09-03     9 2013    1             1