为组R中的每个值查找组中最接近的值

为组R中的每个值查找组中最接近的值,r,dataframe,R,Dataframe,我已经两天没有找到这个了: 我有一个数据帧,在这个结构中有超过2 mil的观测值 id = c(1,2,3,4,5,6,7,8,9,10,11,12) group = c(1,1,1,1,2,2,2,2,3,3,3,3) sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F') time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21) 我想为每一位女性

我已经两天没有找到这个了:

我有一个数据帧,在这个结构中有超过2 mil的观测值

id = c(1,2,3,4,5,6,7,8,9,10,11,12)
group = c(1,1,1,1,2,2,2,2,3,3,3,3)
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F')
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)
我想为每一位女性找到最接近时间的男性,并按组进行分类

举例来说,id=2是第1组中时间为11的女性,而第1组中距离最近的男性id=3

每组每名女性的ect

我试着用这样的东西

keep <- function(x){
   a <-  df[which.min(abs(df[which(df[,'sex'] == "M"),'time']-x[,'time'])),]
   return(a) 
}

apply(df, 1, keep)

keep您可以将
data.frame()
分成雄性和雌性两组,然后使用
outer()
查找所有组合在时间上的绝对差异

代码:

lapply(split(df, df[, "group"]), function(x){
  # split by sex
  tmp1 <- split(x, x[, "sex"])
  
  # time difference for every combination
  tmp2 <- abs(t(outer(tmp1[["M"]][, "time"], tmp1[["F"]][, "time"], "-")))
  
  # find minimum for each woman (rowwise minimum)
  # and connect those numbers with original ID in input data.frame
  tmp3 <- tmp1[["M"]][apply(tmp2, 1, which.min), ]
  
  # ronames to represent female ID
  rownames(tmp3) <- tmp1[["F"]][, "id"]
  
  # return
  tmp3
})

# $`1`
#   id group sex time
# 2  3     1   M 11.5
#
# $`2`
#   id group sex time
# 6  5     2   M 13.2
# 7  5     2   M 13.2
# 8  5     2   M 13.2
#
# $`3`
#    id group sex time
# 12  9     3   M   18

你想要下面这样的东西吗

setDT(df)[
  ,
  c(
    .SD[sex == "F"],
    .(closestM_id = id[sex == "M"][max.col(-abs(outer(
      time[sex == "F"],
      time[sex == "M"], "-"
    )))])
  ), group
]

   group id sex time closestM_id
1:     1  2   F 11.0           3
2:     2  6   F 15.0           5
3:     2  7   F  9.0           5
4:     2  8   F  7.4           5
5:     3 12   F 21.0           9
数据

df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
                 group = c(1,1,1,1,2,2,2,2,3,3,3,3),
                 sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F'),
                 time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21))
> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
    group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
    "F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
    11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))

重组数据将有所帮助。为每种性别创建一个单独的数据框,创建第三个数据集,其中包含所有唯一的雄性和雌性配对,然后合并和子集,将其缩小到所需的配对
expand.grid
对于计算这些类型的组合非常方便,之后,
dplyr
函数可以用来处理其余的逻辑

library(dplyr)

# create one data set for females
females <- df %>%
    filter(sex == "F") %>%
    select(f_id = id, f_time = time, f_group = group)

# create one data set for males
males <- df %>%
    filter(sex == "M") %>%
    select(m_id = id, m_time = time, m_group = group)

# All possible pairings of males and females
pairs <- expand.grid(f_id = females %>% pull(f_id),
                     m_id = males %>% pull(m_id),
                     stringsAsFactors = FALSE) 

# Merge in information about each individual
pairs <- pairs %>%
    left_join(females, by = "f_id") %>%
    left_join(males, by = "m_id") %>%
    # eliminate any pairings that are in different groups
    filter(f_group == m_group) 

pairs
输出,最近的对

# A tibble: 5 x 2
# Groups:   f_id [5]
   m_id  f_id
  <dbl> <dbl>
1     3     2
2     5     6
3     5     7
4     5     8
5     9    12
#一个tible:5 x 2
#组别:f_id[5]
m_id f_id
1     3     2
2     5     6
3     5     7
4     5     8
5     9    12

数据。表
使用滚动连接到最近时间的解决方案。
使用托马斯回答中的
df

setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",], 
                                            x.id, 
                                            on = .(group, time), roll = "nearest"]]
#    id group sex time closestM_id
# 1:  2     1   F 11.0           3
# 2:  6     2   F 15.0           5
# 3:  7     2   F  9.0           5
# 4:  8     2   F  7.4           5
# 5: 12     3   F 21.0           9

男性和女性的数量是不一样的,所以用矢量化的a-b计算直接减去每组的时间是行不通的。嗨,谢谢你的回答。我尝试使用我的完整data.frame,但出现了以下错误:
.rownamesdfinterest中的错误,一个可能的问题是
id
变量不是
data.frame()
中每个女性的唯一标识符。感谢您的回答,我正在尝试使其与我的完整data.Frame一起工作。当我尝试为所有closestM_id返回NA时,我认为这是因为我的组变量是一个带有字母a的“因子”,而不是整数列表,它可以完美地工作。非常感谢!忘了我之前的评论吧,
roll=
optionsanks非常棒的答案,我喜欢这种方式,非常方便。
setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",], 
                                            x.id, 
                                            on = .(group, time), roll = "nearest"]]
#    id group sex time closestM_id
# 1:  2     1   F 11.0           3
# 2:  6     2   F 15.0           5
# 3:  7     2   F  9.0           5
# 4:  8     2   F  7.4           5
# 5: 12     3   F 21.0           9