如何从数据帧*中对均匀间隔的样本进行子集，而不在R中重复*？_R_Subset

如何从数据帧*中对均匀间隔的样本进行子集，而不在R中重复*？

如何从数据帧*中对均匀间隔的样本进行子集，而不在R中重复*？,r,subset,R,Subset,我试图在R中创建一个较大数据集的等距（时间或深度）子集。我的原始数据不是等距的以下是需要改进的功能： # calculate step size and subsets df accordingly spacedSS <- function(df, n, var){ stp <- (max(var)-min(var))/(n - 1) #calculate step size stps <- min(var)+0:(n-1)*stp

我试图在R中创建一个较大数据集的等距（时间或深度）子集。我的原始数据不是等距的

以下是需要改进的功能：

# calculate step size and subsets df accordingly
spacedSS <- function(df, n, var){
    stp <- (max(var)-min(var))/(n - 1)       #calculate step size
    stps <- min(var)+0:(n-1)*stp             #calculate step values
    res <- lookupDepth(df, stps, var)
    return(as.data.frame(res))
}

# finds values in var closest to stps, returns subsetted df
lookupDepth <- function(df, stps, var){ 
    indxs <- rep(0, times=length(stps)) # create empty index vector
    for(i in seq_along(stps)) {         # for every subsample row
                                        # find the one closest to the step value
                                        # TODO: only if it isn't already in the df
        indxs[i] <- which.min((var - stps[i])^2)
    }
    sampls <- df[indxs, ]               #subset by these new indexes
    return(as.data.frame(sampls))
}

因此，我试图解决的问题是，子集函数当前没有查看已经使用的索引：

# the problem I'm trying to solve:
length(unique(ss.age$id)) != length(unique(ss.depth$id))
TRUE
# it picked the same samples sometimes because they were the closest ones!
ss.age$id
[1]  1 45 53 55 55 56 57 57 61 78

正如您所看到的，问题在于当它是子集时，它当前没有考虑已经选择的样本。你知道如何解决这个问题吗？

所以我最后请了一位朋友帮我，我们构建了一个相当复杂的方法

基本上，我们创建了一个函数来查看是否有任何重复的索引值，如果有，则只需简单地修复它们。然后，变异函数随机改变索引值。根据原始数据集检查此新子集的丢失情况，如果随机突变优于先前选择，则生成并选择随机突变。选择标准一开始相当宽松，但随着时间的推移会变得更加严格，从而产生一个非常酷的数据优化子集

如果您对我们使用的代码感兴趣，请在下面进行评论，我们将把它放到某个地方。

因此，我最后请了一位朋友帮助我，我们构建了一个相当复杂的方法

如果您对我们使用的代码感兴趣，请在下面发表评论，我们将把它放在某个地方

# plot it using my depthplotter function
source("https://raw.githubusercontent.com/japhir/DepthPlotter/master/DepthPlotter.R")
DepthPlotter(dat[, c("depth", "age")], xlab = "Age (Ma)")
segments(30, ss.depth$depth, ss.depth$age, col = "blue")
segments(ss.age$age, 640, y1 = ss.age$depth, col = "red")

# the problem I'm trying to solve:
length(unique(ss.age$id)) != length(unique(ss.depth$id))
TRUE
# it picked the same samples sometimes because they were the closest ones!
ss.age$id
[1]  1 45 53 55 55 56 57 57 61 78