Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 按组检测序列并为子集计算新变量_R_Performance_Dataframe_Group By_Data.table_Tidyverse - Fatal编程技术网

R 按组检测序列并为子集计算新变量

R 按组检测序列并为子集计算新变量,r,performance,dataframe,group-by,data.table,tidyverse,R,Performance,Dataframe,Group By,Data.table,Tidyverse,我需要在data.frame中按组检测序列并计算新变量 假设我有以下数据。frame: df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3), seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4), count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1), product = c("A", "B", "C",

我需要在data.frame中按组检测序列并计算新变量

假设我有以下
数据。frame

df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
              seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
              count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
              product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

df1

> df1
   ID seqs count product   stock
1   1    1     2       A       A
2   1    2     1       B     A,B
3   1    3     3       C   A,B,C
4   1    4     1       C   A,B,C
5   1    5     1     A,B   A,B,C
6   1    6     2   A,B,C   A,B,C
7   1    7     3       D A,B,C,D
8   2    1     1       A       A
9   2    2     2       B     A,B
10  2    3     1       A     A,B
11  3    1     3       A       A
12  3    2     1   A,B,C   A,B,C
13  3    3     4       D A,B,C,D
14  3    4     1       D A,B,C,D
在本例中,这适用于:

对于这些ID和行,我需要计算一个名为
new
的度量值,该度量值取序列最后一行
产品
的值,如果
位于序列的第二行,而不是第一个序列的
库存

预期结果如下所示:

> output
  ID seq1 seq2 seq3 new
1  1    2    3    4   C
2  2    1    2    3    
3  3    2    3    4   D
注意:

  • 在ID检测序列中,未向库存添加新产品
  • 在原始数据中,有许多ID没有任何序列
  • 一些
    ID
    具有多个限定序列。所有这些都应该记录下来
  • 计数始终为1或更大
  • 原始数据包含数百万个
    ID
    ,最多包含1500个序列

  • 如何编写一段高效的代码来获得此输出?

    这里有一个
    数据。表
    选项:

    库(data.table)
    
    char_cols这里是另一种使用;然而,我认为
    lag
    lead
    使得这个解决方案有点耗时。我在代码中加入了注释,使其更清晰

    但我花了足够的时间在上面,不管怎样,我还是把它贴出来了


    计数是否可以小于1?@C.Braun No。计数始终为1或更大。
    setDT
    data.frame
    转换为
    data.table
    。您可以使用
    setDF
    进行转换。
     - rows 2, 3, 4 for `ID==1`
     - rows 8, 9, 10 for `ID==2`
     - rows 12, 13, 14 for `ID==3`
    
    > output
      ID seq1 seq2 seq3 new
    1  1    2    3    4   C
    2  2    1    2    3    
    3  3    2    3    4   D
    
    library(tidyverse)
    
    df1 %>% group_by(ID) %>%  
    
     # this finds the row with count > 1 which ...
     #... the counts of the row before and the one of after it equals to 1
     mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>% 
    
     # this makes a column which has value of True for each chunk...      
     #that meets desired condition to later filter based on it
     mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F))  %>% 
    
     filter(test2) %>% ungroup() %>% 
    
     # group each three occurrences in case of having multiple ones within each ID
     group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>% 
    
     # creating new column with string extracting techniques ...
     #... (assuming those columns are characters) 
     mutate(new=
     str_remove_all(
        as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
                   stock[1])) %>% 
    
      # selecting desired columns and adding times for long to wide conversion
      select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>% 
    
      # long to wide conversion using tidyr (part of tidyverse)
      gather(key, value, -ID, -G, -new, -times) %>%
      unite(col, key, times) %>% spread(col, value) %>% 
    
      # making the desired order of columns
      select(-G,-new,new) %>% as.data.frame()
    
    #   ID seqs_1 seqs_2 seqs_3 new
    # 1  1      2      3      4   C
    # 2  2      1      2      3    
    # 3  3      2      3      4   D