R 分组观察中以重复为条件的虚拟变量编辑_R_Dummy Variable

R 分组观察中以重复为条件的虚拟变量编辑

R 分组观察中以重复为条件的虚拟变量编辑,r,dummy-variable,R,Dummy Variable,谢谢你的回复。然而，我仍然没有设法解决我的问题，因为我的数据集包含700000个观测值，下面的所有方法都会导致错误，或者只是继续运行数小时而没有完成（我可以看出Rstudio R会话正在运行，并且消耗了我的大量RAM，但它根本不会去任何地方）正如您所想象的，将数据集拆分为更小的部分不是一个选项，因为这会破坏练习的目的：我需要查看之前的每一个观察结果，以获得所需的结果有什么想法吗？我会暂时不回答这个问题，但如果你们认为我应该发布一个新问题，我会的（我真的不知道关于这些事情的礼仪，所以请随意留下

谢谢你的回复。然而，我仍然没有设法解决我的问题，因为我的数据集包含700000个观测值，下面的所有方法都会导致错误，或者只是继续运行数小时而没有完成（我可以看出Rstudio R会话正在运行，并且消耗了我的大量RAM，但它根本不会去任何地方）

正如您所想象的，将数据集拆分为更小的部分不是一个选项，因为这会破坏练习的目的：我需要查看之前的每一个观察结果，以获得所需的结果

有什么想法吗？我会暂时不回答这个问题，但如果你们认为我应该发布一个新问题，我会的（我真的不知道关于这些事情的礼仪，所以请随意留下建议）

原职正如标题所示，我正在寻找一个虚拟变量，它以分组观察中的重复为条件

考虑以下数据帧：

   id name year
1   c   af 2000
2   c   el 2000
3   c   in 2000
4   c   ud 2000
5   d   ot 2000
6   d   an 2000
7   d   el 2000
8   d   un 2000
9   f   yt 2002
10  f   ip 2002
11  f   ot 2002
12  f   el 2002
13  g   yt 2003
14  g   af 2003
15  g   ol 2003
16  g   in 2003
17  h   in 2003
18  h   eg 2003
19  h   yt 2003
20  h   af 2003
21  j   ot 2004
22  j   el 2004
23  j   ip 2004
24  j   yt 2004

我正在寻找一个函数，该函数允许我按id对数据进行分组，如果一个id包含至少三个名称，则返回值“1”。按前一个id，我的意思是前一个id的年份必须小于当前id的年份

所需的输出应如下所示：

   id name year dummy
1   c   af 2000     0
2   c   el 2000     0
3   c   in 2000     0
4   c   ud 2000     0
5   d   ot 2000     0
6   d   an 2000     0
7   d   el 2000     0
8   d   un 2000     0
9   f   yt 2002     0
10  f   ip 2002     0
11  f   ot 2002     0
12  f   el 2002     0
13  g   yt 2003     0
14  g   af 2003     0
15  g   ol 2003     0
16  g   in 2003     0
17  h   in 2003     0
18  h   eg 2003     0
19  h   yt 2003     0
20  h   af 2003     0
21  j   ot 2004     1
22  j   el 2004     1
23  j   ip 2004     1
24  j   yt 2004     1

id=“j”具有值dummy=“1”，因为id=“f”中至少有三个名称“yt”、“ip”和“ot”。在本例中，还出现了第四个名称“el”，但这并不影响结果

请注意，id=“h”的值为dummy=“0”，尽管id=“g”中也出现了三个名称。这是因为这两起事件都发生在2003年，因此不符合单独年份的条件

数据：

以下是我使用dplyr和tidyr以及一个函数来识别具有3个或更多匹配名称的ID的解决方案：

library(dplyr)
library(tidyr)

test <- function(x){
  out2 <- sapply(1:length(x), function(j){
    out <- sapply(1:j, function(i){
      sum(x[[j]] %in% x[[i]])
    })
    out[j]<-NA
    which(out >= 3) %>% min() %>% {ifelse(is.infinite(.),NA,.)}

  })
  out2
}

DF2 <-  DF %>% group_by(id, year) %>% 
  summarise(names = list(name)) %>% ungroup() %>% 
  mutate(dummy2 = test(names)) %>% 
  mutate(year_mch = year[dummy2], 
         dummy = year_mch < year) %>% 
  unnest() 
DF2

库（dplyr）
图书馆（tidyr）
测试%
摘要（名称=列表（名称））%%>%ungroup（）%%>%
突变（dummy2=测试（名称））%>%
突变（年份=年份[dummy2]，
假人=年\妇幼保健<年%>%
unnest（）
DF2

由于无穷大的值，它会给出一系列警告，但不会影响结果。

基本R中的一种方法：

n <- split(DF$name, DF$id)
m1 <- sapply(n, function(s1) sapply(n, function(s2) sum(s1 %in% s2) ))
diag(m1) <- 0
m1[upper.tri(m1)] <- 0
r1 <- rownames(m1)[!!rowSums(m1 > 2)]

y <- sapply(split(DF$year, DF$id), unique)
m2 <- sapply(y, function(s1) sapply(y, function(s2) +(s1 == s2) ))
diag(m2) <- 0
m2[upper.tri(m2)] <- 0
r2 <- rownames(m2)[!rowSums(m2)]

DF$dummy2 <- as.integer(DF$id %in% intersect(r1,r2))

与Jaap和see24类似，但使用

长度（相交（x，y））

而不是

==

%in%

和

行和

和

：

library(data.table)
setDT(DF)
idDT = unique(DF[, .(id, year)])
setkey(idDT, id)

s = split(DF$name, DF$id)

# identify pairs of ids, where id1 appears before id2 in the table
pairsDT = idDT[, CJ(id1 = id, id2 = id)[id1 < id2]]

# record whether it's strictly before 
pairsDT[, earlier := idDT[id1, x.year] < idDT[id2, x.year]]

# if it's strictly before, compare number of matching elements
pairsDT[earlier == TRUE, matched := 
  mapply(function(x, y) length(intersect(x, y)), s[id1], s[id2]) >= 3
]

dum_ids = pairsDT[matched == TRUE, unique(id2)]

在base R中，可以使用

combn

完成类似的操作。我想，与仅仅将数据存储在图形中（例如，使用IGRAPHE包）并从图形开始工作相比，这仍然是非常低效的。

我会找任何借口将数据问题转换为图形问题，因此为弗兰克提出这一点而干杯。这是一个

igraph

解决方案。本质上，它将数据转换为有向树。所有节点仅与层次结构中较高的节点进行比较。因此，C是树的顶部，不与任何其他节点进行比较，J是终端，与链中它上面的所有节点进行比较。要提取层次结构中较高的所有节点，只需使用（深度优先搜索）

dfs

函数

library(tidyverse)
library(igraph)

#node list containing data specific to the group
nodelist <- DF %>%
  group_by(id, year) %>%
  nest()

#edge list containing connections. A group directly before a node points toward a future group
edgelist <- data.frame(
  from = nodelist$id %>% .[1:(length(.)-1)],
  to = nodelist$id %>% .[2:length(.)]
)

#create the data frame
g <- graph_from_data_frame(edgelist, T, nodelist)

#let's iterate through the nodes
dummy <- map_lgl(V(g)$name, function(vertex){

  #depth first search to pull out all nodes higher up on the tree 
  full_path <- dfs(g, vertex, 'in', unreachable = F) %>%
    .$order %>% 
    .[!is.na(.)] 

  #if there is no node higher up, then we're done
  if(length(full_path) <= 1) return(F)

  #The first node returned is the node we're iterating over
  this_vertex <- full_path[1]
  other_vertices <- full_path[full_path != this_vertex]

  #this is the logic for the dummy variable
  similar_groups <- map_lgl(other_vertices, function(other_vertex){
    (sum(this_vertex$data[[1]]$name %in% other_vertex$data$name) >= 3) & 
      (this_vertex$year[[1]] != other_vertex$year)
  })

  return(T %in% similar_groups)
})

V(g)$dummy2 <- dummy

as_data_frame(g, 'vertices') %>%
  unnest()

所以这个解决方案是纯base R。我曾经读过一篇文章，声称使用

%。这是我第一次尝试。我想我喜欢它
. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")] 
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum) 
. <- .[.['n'] >= 3, 'id.y']

DF$dummy2 <- . == DF$id

 在OP编辑了关于速度和内存问题后，采用Rcpp
方法如何：
#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]

library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
    List namelist(iname);
    int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
    NumericVector res(sz);

    for(i=0; i<sz; i++) {
        for(j=0; j<i; j++) {
            if (nid[j] < nid[i] && year[j] < year[i]) {
                SEXP prevList = namelist[j];
                SEXP currList = namelist[i];

                NumericVector cl(currList);
                NumericVector pl(prevList);
                nPrev = pl.size();
                nCurr = cl.size();

                res[i] = 0;
                count = 0;
                for(m=0; m<nCurr; m++) {
                    for (n=0; n<nPrev; n++) {
                        if (cl[m] == pl[n]) {
                            count++;
                            break;
                        }
                    }
                }

                if (count >= 3) {
                    res[i] = 1;
                    break;
                }
            }
        }
    }

    return(res);
}')

d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

          #self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
    #count the number of occurrence in previous id and year
    uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
        #check if more than 3
        any(V1 >= 3L), by=.(i.id, i.year)]

#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]

弗兰克，谢谢你带我来。这对MEID=“h”和year=2003、名称“in”和“af”出现在id=“c”和year=2000、名称“yt”出现在id=“f”和year=2002中都很有趣，那么dummy不应该是1吗？不，名称必须出现在同一个id中，否则一个id至少有三个相同名称的条件没有得到满足。有趣的方法！但是，它返回错误消息：>g>V（g）$dummy2作为_data_frame（g，'顶点'）%%>%+unnest（）错误，出现在%class（graph）中的“igraph”%中：未找到对象“g”
。有什么想法吗？你的节点列表必须在第一列有唯一的值。对！过滤掉重复的代码后，我成功地运行了代码，非常感谢！我还添加了一个进度条来监控流程。它似乎需要大约12秒来循环通过“虚拟”功能，而且它似乎正在减速。如果对所有71883次观测都这样做，则Rstudio至少需要10天才能完成该功能。我绝对认为这种方法既美观又直观，但您认为这种方法最快吗？谢谢您的解决方案。它适用于示例，但在应用于我的~700000个观测数据集时不起作用（请参见问题中的编辑）。@LucasE我添加了一个RcppapproachI安装了必要的软件包，并尝试运行代码。一个小时后，Rstudio仍在运行代码，我不得不终止它。此时，Rstudios停止响应，我不得不重新启动Rstudios。不过还是要谢谢你！（我想指出的是，我是在我的工作笔记本电脑上工作的，这可能不足以处理如此大的操作。）@LucasE我可以问一下解决方案现在是否有效吗？还是你只是想结束这个问题？
   name year dummy2 name1 dummy
1     c 2000  FALSE    af     0
2     c 2000  FALSE    el     0
3     c 2000  FALSE    in     0
4     c 2000  FALSE    ud     0
5     d 2000  FALSE    ot     0
6     d 2000  FALSE    an     0
7     d 2000  FALSE    el     0
8     d 2000  FALSE    un     0
9     f 2002  FALSE    yt     0
10    f 2002  FALSE    ip     0
11    f 2002  FALSE    ot     0
12    f 2002  FALSE    el     0
13    g 2003  FALSE    yt     0
14    g 2003  FALSE    af     0
15    g 2003  FALSE    ol     0
16    g 2003  FALSE    in     0
17    h 2003  FALSE    in     0
18    h 2003  FALSE    eg     0
19    h 2003  FALSE    yt     0
20    h 2003  FALSE    af     0
21    j 2004   TRUE    ot     1
22    j 2004   TRUE    el     1
23    j 2004   TRUE    ip     1
24    j 2004   TRUE    yt     1

. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")] 
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum) 
. <- .[.['n'] >= 3, 'id.y']

DF$dummy2 <- . == DF$id

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]

library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
    List namelist(iname);
    int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
    NumericVector res(sz);

    for(i=0; i<sz; i++) {
        for(j=0; j<i; j++) {
            if (nid[j] < nid[i] && year[j] < year[i]) {
                SEXP prevList = namelist[j];
                SEXP currList = namelist[i];

                NumericVector cl(currList);
                NumericVector pl(prevList);
                nPrev = pl.size();
                nCurr = cl.size();

                res[i] = 0;
                count = 0;
                for(m=0; m<nCurr; m++) {
                    for (n=0; n<nPrev; n++) {
                        if (cl[m] == pl[n]) {
                            count++;
                            break;
                        }
                    }
                }

                if (count >= 3) {
                    res[i] = 1;
                    break;
                }
            }
        }
    }

    return(res);
}')

d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

          #self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
    #count the number of occurrence in previous id and year
    uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
        #check if more than 3
        any(V1 >= 3L), by=.(i.id, i.year)]

#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]