为组R中的每个值查找组中最接近的值
我已经两天没有找到这个了: 我有一个数据帧,在这个结构中有超过2 mil的观测值为组R中的每个值查找组中最接近的值,r,dataframe,R,Dataframe,我已经两天没有找到这个了: 我有一个数据帧,在这个结构中有超过2 mil的观测值 id = c(1,2,3,4,5,6,7,8,9,10,11,12) group = c(1,1,1,1,2,2,2,2,3,3,3,3) sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F') time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21) 我想为每一位女性
id = c(1,2,3,4,5,6,7,8,9,10,11,12)
group = c(1,1,1,1,2,2,2,2,3,3,3,3)
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F')
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)
我想为每一位女性找到最接近时间的男性,并按组进行分类
举例来说,id=2是第1组中时间为11的女性,而第1组中距离最近的男性id=3
每组每名女性的ect
我试着用这样的东西
keep <- function(x){
a <- df[which.min(abs(df[which(df[,'sex'] == "M"),'time']-x[,'time'])),]
return(a)
}
apply(df, 1, keep)
keep您可以将data.frame()
分成雄性和雌性两组,然后使用outer()
查找所有组合在时间上的绝对差异
代码:
lapply(split(df, df[, "group"]), function(x){
# split by sex
tmp1 <- split(x, x[, "sex"])
# time difference for every combination
tmp2 <- abs(t(outer(tmp1[["M"]][, "time"], tmp1[["F"]][, "time"], "-")))
# find minimum for each woman (rowwise minimum)
# and connect those numbers with original ID in input data.frame
tmp3 <- tmp1[["M"]][apply(tmp2, 1, which.min), ]
# ronames to represent female ID
rownames(tmp3) <- tmp1[["F"]][, "id"]
# return
tmp3
})
# $`1`
# id group sex time
# 2 3 1 M 11.5
#
# $`2`
# id group sex time
# 6 5 2 M 13.2
# 7 5 2 M 13.2
# 8 5 2 M 13.2
#
# $`3`
# id group sex time
# 12 9 3 M 18
你想要下面这样的东西吗
setDT(df)[
,
c(
.SD[sex == "F"],
.(closestM_id = id[sex == "M"][max.col(-abs(outer(
time[sex == "F"],
time[sex == "M"], "-"
)))])
), group
]
给
group id sex time closestM_id
1: 1 2 F 11.0 3
2: 2 6 F 15.0 5
3: 2 7 F 9.0 5
4: 2 8 F 7.4 5
5: 3 12 F 21.0 9
数据
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
group = c(1,1,1,1,2,2,2,2,3,3,3,3),
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F'),
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21))
> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
"F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))
重组数据将有所帮助。为每种性别创建一个单独的数据框,创建第三个数据集,其中包含所有唯一的雄性和雌性配对,然后合并和子集,将其缩小到所需的配对expand.grid
对于计算这些类型的组合非常方便,之后,dplyr
函数可以用来处理其余的逻辑
library(dplyr)
# create one data set for females
females <- df %>%
filter(sex == "F") %>%
select(f_id = id, f_time = time, f_group = group)
# create one data set for males
males <- df %>%
filter(sex == "M") %>%
select(m_id = id, m_time = time, m_group = group)
# All possible pairings of males and females
pairs <- expand.grid(f_id = females %>% pull(f_id),
m_id = males %>% pull(m_id),
stringsAsFactors = FALSE)
# Merge in information about each individual
pairs <- pairs %>%
left_join(females, by = "f_id") %>%
left_join(males, by = "m_id") %>%
# eliminate any pairings that are in different groups
filter(f_group == m_group)
pairs
输出,最近的对
# A tibble: 5 x 2
# Groups: f_id [5]
m_id f_id
<dbl> <dbl>
1 3 2
2 5 6
3 5 7
4 5 8
5 9 12
#一个tible:5 x 2
#组别:f_id[5]
m_id f_id
1 3 2
2 5 6
3 5 7
4 5 8
5 9 12
数据。表
使用滚动连接到最近时间的解决方案。
使用托马斯回答中的df
setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",],
x.id,
on = .(group, time), roll = "nearest"]]
# id group sex time closestM_id
# 1: 2 1 F 11.0 3
# 2: 6 2 F 15.0 5
# 3: 7 2 F 9.0 5
# 4: 8 2 F 7.4 5
# 5: 12 3 F 21.0 9
男性和女性的数量是不一样的,所以用矢量化的a-b计算直接减去每组的时间是行不通的。嗨,谢谢你的回答。我尝试使用我的完整data.frame,但出现了以下错误:.rownamesdfinterest中的错误,一个可能的问题是id
变量不是data.frame()
中每个女性的唯一标识符。感谢您的回答,我正在尝试使其与我的完整data.Frame一起工作。当我尝试为所有closestM_id返回NA时,我认为这是因为我的组变量是一个带有字母a的“因子”,而不是整数列表,它可以完美地工作。非常感谢!忘了我之前的评论吧,roll=
optionsanks非常棒的答案,我喜欢这种方式,非常方便。
setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",],
x.id,
on = .(group, time), roll = "nearest"]]
# id group sex time closestM_id
# 1: 2 1 F 11.0 3
# 2: 6 2 F 15.0 5
# 3: 7 2 F 9.0 5
# 4: 8 2 F 7.4 5
# 5: 12 3 F 21.0 9