R 每年过滤一行_R_Dplyr - Fatal编程技术网

R 每年过滤一行

R 每年过滤一行,r,dplyr,R,Dplyr,我正在尝试筛选一个日期列表，只包括一年一次的日期，它会在每个包含日期重置在下表中，我只想筛选出include=1的行（在本例中，我手动创建了include列）。如果你仔细观察： id=10包括在内，因为那是在id=1之后一年多的时间，而id=9还没有包括在内 id=22被包括在内，因为那是在id=10之后一年多的时间，而id=21还没有包括在内表，按testdate升序排序，显然： | id | testdate | include | | | |

我正在尝试筛选一个日期列表，只包括一年一次的日期，它会在每个包含日期重置

在下表中，我只想筛选出

include=1

的行（在本例中，我手动创建了

include

列）。如果你仔细观察：

```
id=10
```
包括在内，因为那是在
```
id=1
```
之后一年多的时间，而
```
id=9
```
还没有包括在内
```
id=22
```
被包括在内，因为那是在
```
id=10
```
之后一年多的时间，而
```
id=21
```
还没有包括在内

表，按

testdate

升序排序，显然：

| id |  testdate  | include |
|    |            |         |
|    |            | (I want |
|    |            |  this   |
|    |            | column) |
|:--:|:----------:|:-------:|
|  1 | 2008-02-26 |    1*   |
|  2 | 2008-03-07 |    0    |
|  3 | 2008-04-03 |    0    |
|  4 | 2008-04-25 |    0    |
|  5 | 2008-07-23 |    0    |
|  6 | 2008-10-09 |    0    |
|  7 | 2008-10-28 |    0    |
|  8 | 2009-01-14 |    0    |
|  9 | 2009-01-28 |    0    |
| 10 | 2009-05-19 |    1*   |
| 11 | 2009-06-05 |    0    |
| 12 | 2009-06-05 |    0    |
| 13 | 2009-06-26 |    0    |
| 14 | 2009-07-15 |    0    |
| 15 | 2009-07-15 |    0    |
| 16 | 2009-08-18 |    0    |
| 17 | 2009-08-18 |    0    |
| 18 | 2009-09-08 |    0    |
| 19 | 2009-09-25 |    0    |
| 20 | 2010-03-19 |    0    |
| 21 | 2010-04-06 |    0    |
| 22 | 2010-06-30 |    1*   |
| 23 | 2010-10-07 |    0    |
| 24 | 2010-10-21 |    0    |
| 25 | 2010-10-30 |    0    |
| 26 | 2010-12-10 |    0    |
| 27 | 2011-03-04 |    0    |
| 28 | 2011-05-11 |    0    |
| 29 | 2012-03-08 |    1*   |
| 30 | 2012-03-23 |    0    |
| 31 | 2012-09-13 |    0    |
| 32 | 2013-03-21 |    1*   |
| 33 | 2014-10-08 |    1*   |
-----------------------------

我用

dplyr

库所做的尝试：

#计算时间间隔
变异（间隔=as.double（difftime（testdate，lag（testdate），units='days'））%>%
#以天为单位累积间隔
突变（间隔_cum=if_else（is.na（间隔），-1，间隔+滞后（间隔））%>%
突变（间隔_cum2=如果_else（滞后（间隔）>365,0，间隔_cum））%>%
#过滤掉第一行和所有相关的累计间隔
变异（包括=if_else（行数（testdate）==1 |间隔>365 |间隔| cum==-1 |间隔| cum2>365,1,0,0））

但这会遗漏id的10、22和32，因为我不能迭代多行。有人知道一个有效的方法来实现这一点吗

R的原始数据输入：

结构（列表（testdate=structure）（c（13935139451397213994， 14083, 14161, 14180, 14258, 14272, 14383, 14400, 14400, 14421, 14440, 14440, 14474, 14474, 14495, 14512, 14687, 14705, 14790, 14889, 14903, 14912, 14953, 15037, 15105, 15407, 15422, 15596, 1578516351），class=“Date”），include=c（1,0,0,0,0， 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,1,0,0,1,1）），.Names=c（“testdate”，“include”），row.Names=c（NA， -类=c（“待定”、“待定”、“数据帧”））

开始日期将包含循环后要包含的日期向量：

start_date <- datum$testdate[1]
for (x in datum$testdate) {
  check_new <- (start_date[length(start_date)] + 365)
  if (x > check_new) {
    start_date <- c(start_date, x)
  }
}

开始日期
或者使用这样的自定义函数
identify_new_year = function(x){
    indices = integer(0)
    start = x[1]
    ind = 1
    indices[ind] = ind
    for (i in 2:length(x)){
        if (as.numeric(x[i] - start >= 365)){
            ind = ind + 1
            indices[ind] = i
            start = x[i]
        }
    }
    return(indices)
}

identify_new_year(df$testdate)
#[1]  1 10 22 29 32 33

的确正如我所说，我手动创建了这个列，我正在寻找一种设置include列的方法。我认为这些问题与此相关：这非常巧妙！但是在R中允许循环练习吗？我来自MySQL，所以我很注重程序，不认为R应该被这样对待。但这是有效的！这太神奇了。那么复杂性更好了？所有这些解决方案都使用循环。我喜欢这个解决方案，但我不觉得它更具可读性。为了避免多次比较，另一种方法是findInterval
：d=df$testdate；inds=1L；而(i)
identify_new_year = function(x){
    indices = integer(0)
    start = x[1]
    ind = 1
    indices[ind] = ind
    for (i in 2:length(x)){
        if (as.numeric(x[i] - start >= 365)){
            ind = ind + 1
            indices[ind] = i
            start = x[i]
        }
    }
    return(indices)
}

identify_new_year(df$testdate)
#[1]  1 10 22 29 32 33