R 使用二进制搜索查找向量中最近的值

R 使用二进制搜索查找向量中最近的值,r,R,作为一个愚蠢的玩具例子,假设 x=4.5 w=c(1,2,4,6,7) 我想知道是否有一个简单的R函数可以在w中找到与x最匹配的索引。因此,如果foo是该函数,foo(w,x)将返回3。函数match是正确的想法,但似乎只适用于精确匹配 解决方案(例如which.min(abs(w-x)),which(abs(w-x)=min(abs(w-x))等)都是O(n)而不是log(n)(我假设w已经排序) 如果向量很长,请尝试两步方法: x = 4.5 w = c(1,2,4,6,7) sdev

作为一个愚蠢的玩具例子,假设

x=4.5
w=c(1,2,4,6,7)
我想知道是否有一个简单的R函数可以在
w
中找到与
x
最匹配的索引。因此,如果
foo
是该函数,
foo(w,x)
将返回
3
。函数
match
是正确的想法,但似乎只适用于精确匹配

解决方案(例如
which.min(abs(w-x))
which(abs(w-x)=min(abs(w-x))
等)都是
O(n)
而不是
log(n)
(我假设
w
已经排序)

如果向量很长,请尝试两步方法:

x = 4.5
w = c(1,2,4,6,7)

sdev = sapply(w,function(v,x) abs(v-x), x = x)
closestLoc = which(min(sdev))
对于令人抓狂的长向量(数百万行),警告-对于不是非常、非常、非常大的数据,这实际上会更慢。)


这个例子只是给你一个当你有大量数据时利用并行处理的基本概念。注意,我不建议您将其用于简单快速的函数,如abs()。

您可以使用
数据。table
进行二进制搜索:

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted
请注意,如果列
w
尚未排序,则必须使用
setkey(dt,w)
而不是
setattr(.)

在最后一个表达式中,
val
列将包含您要查找的值

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3

将使用price is right matching(最接近,不经过检查)执行此操作。

要在字符向量上执行此操作,Martin Morgan建议将此函数用于:


bsearch7您始终可以实现自定义二进制搜索算法来查找最接近的值。或者,您可以利用libc bsearch()的标准实现。您也可以使用其他二进制搜索实现,但这并不能改变您必须仔细实现比较函数以查找数组中最近的元素这一事实。标准二进制搜索实现的问题是,它用于精确比较。这意味着您的临时比较函数需要进行某种精确处理,以确定数组中的元素是否足够接近。要实现此功能,比较功能需要了解阵列中的其他元素,尤其是以下方面:

  • 当前元素(与当前元素进行比较的元素)的位置 键)
  • 使用键的距离及其与邻居的比较(上一页) 或下一个元素)
为了在比较函数中提供这些额外的知识,需要用附加信息(不仅仅是键值)打包键。一旦比较函数意识到这些方面,它就可以判断元素本身是否最接近。当它知道它是最近的时,它返回“匹配”

下面的C代码查找最接近的值

#include <stdio.h>
#include <stdlib.h>

struct key {
        int key_val;
        int *array_head;
        int array_size;
};

int compar(const void *k, const void *e) {
        struct key *key = (struct key*)k;
        int *elem = (int*)e;
        int *arr_first = key->array_head;
        int *arr_last = key->array_head + key->array_size -1;
        int kv = key->key_val;
        int dist_left;
        int dist_right;

        if (kv == *elem) {
                /* easy case: if both same, got to be closest */
                return 0;
        } else if (key->array_size == 1) {
                /* easy case: only element got to be closest */
                return 0;
        } else if (elem == arr_first) {
                /* element is the first in array */
                if (kv < *elem) {
                        /* if keyval is less the first element then
                         * first elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between first and 2nd elem.
                         * if distance with first elem is smaller, it is closest.
                         */
                        dist_left = kv - *elem;
                        dist_right = *(elem+1) - kv;
                        return (dist_left <= dist_right) ? 0:1;
                }
        } else if (elem == arr_last) {
                /* element is the last in array */
                if (kv > *elem) {
                        /* if keyval is larger than the last element then
                         * last elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between last and last-but-one.
                         * if distance with last elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem-1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* condition for remaining cases (other cases are handled already):
         * - elem is neither first or last in the array
         * - array has atleast three elements.
         */

        if (kv < *elem) {
                /* keyval is smaller than elem */

                if (kv <= *(elem -1)) {
                        /* keyval is smaller than previous (of "elem") too.
                         * hence, elem cannot be closest.
                         */
                        return -1;
                } else {
                        /* check distance between elem and elem-prev.
                         * if distance with elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem -1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* remaining case: (keyval > *elem) */

        if (kv >= *(elem+1)) {
                /* keyval is larger than next (of "elem") too.
                 * hence, elem cannot be closest.
                 */
                return 1;
        }

        /* check distance between elem and elem-next.
         * if distance with elem is smaller, it is closest.
         */
        dist_right = *(elem+1) - kv;
        dist_left = kv - *elem;
        return (dist_left <= dist_right) ? 0:1;
}


int main(int argc, char **argv) {
        int arr[] = {10, 20, 30, 40, 50, 60, 70};
        int *found;
        struct key k;

        if (argc < 2) {
                return 1;
        }

        k.key_val = atoi(argv[1]);
        k.array_head = arr;
        k.array_size = sizeof(arr)/sizeof(int);

        found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int),
                compar);

        if(found) {
                printf("found closest: %d\n", *found);
        } else {
                printf("closest not found. absurd! \n");
        }

        return 0;
}
#包括
#包括
结构键{
int key_val;
int*阵列头;
int数组的大小;
};
整数比较(常量无效*k,常量无效*e){
结构键*键=(结构键*)k;
int*elem=(int*)e;
int*arr\u first=键->数组头;
int*arr\u last=key->array\u head+key->array\u size-1;
int kv=键->键值;
int dist_左;
国际区右;
如果(kv==*elem){
/*简单的例子:如果两者都相同,就要最接近*/
返回0;
}else if(键->数组大小==1){
/*简单案例:只有元素最接近*/
返回0;
}else if(elem==arr_first){
/*元素是数组中的第一个元素*/
如果(千伏<*elem){
/*如果keyval小于第一个元素,则
*第一个元素是最接近的。
*/
返回0;
}否则{
/*检查第一个和第二个元素之间的距离。
*如果与第一个元素的距离较小,则距离最近。
*/
左距离=千伏-*元素;
右距离=*(元素+1)-kv;
返回(距离左*元素){
/*如果keyval大于最后一个元素,则
*最后一个元素是最近的。
*/
返回0;
}否则{
/*检查最后一个和最后一个之间的距离。
*如果与最后一个元素的距离较小,则距离最近。
*/
左距离=kv-*(elem-1);
右距离=*千伏;
返回(dist_right
NearestValueSearch=函数(x,w){
##一种简单的二进制搜索算法
##假设w向量已排序,因此我们可以使用二进制搜索
左=1
右=长度(w)
while(右-左>1){
中间=楼层((左+右)/2)
if(x
请参见MALDIquant软件包中的
匹配.closest()

> library(MALDIquant)
> match.closest(x, w)
[1] 3

基于@neal fultz answer,下面是一个使用
findInterval()
的简单函数:


get_nextest\u index我也有类似的想法,但考虑到OP想要向量的索引,我会这样做:
dt=data.table(w,key=“w”);dt[J(x),.I,roll=“nextest”][[2]]
@Arun——所以做
属性(dt)Hmm,看起来确实像是这样。除非“属性”,否则不会调用未排序的
设置好了。我想知道为什么。我想这里可能有加速。我会检查一下。在
J(x)
J
是什么,在
dt[J(x),roll=“nearest”]
?@ConnerM。这是
数据.table
的快捷方式。现在你也可以使用
而不是
J
。刚才看到了。data.table就是方法!findInterval{base}查找Interva
# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3
R>findInterval(4.5, c(1,2,4,5,6))
[1] 3
bsearch7 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         tabM <- tab[M]
         val0 <- val[i0]
         i <- tabM < val0
         updt[i] <- M[i] + 1L
         i <- tabM > val0
         updt[i] <- M[i] - 1L
         b[i0 + i * length(val)] <- updt
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
} 
#include <stdio.h>
#include <stdlib.h>

struct key {
        int key_val;
        int *array_head;
        int array_size;
};

int compar(const void *k, const void *e) {
        struct key *key = (struct key*)k;
        int *elem = (int*)e;
        int *arr_first = key->array_head;
        int *arr_last = key->array_head + key->array_size -1;
        int kv = key->key_val;
        int dist_left;
        int dist_right;

        if (kv == *elem) {
                /* easy case: if both same, got to be closest */
                return 0;
        } else if (key->array_size == 1) {
                /* easy case: only element got to be closest */
                return 0;
        } else if (elem == arr_first) {
                /* element is the first in array */
                if (kv < *elem) {
                        /* if keyval is less the first element then
                         * first elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between first and 2nd elem.
                         * if distance with first elem is smaller, it is closest.
                         */
                        dist_left = kv - *elem;
                        dist_right = *(elem+1) - kv;
                        return (dist_left <= dist_right) ? 0:1;
                }
        } else if (elem == arr_last) {
                /* element is the last in array */
                if (kv > *elem) {
                        /* if keyval is larger than the last element then
                         * last elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between last and last-but-one.
                         * if distance with last elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem-1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* condition for remaining cases (other cases are handled already):
         * - elem is neither first or last in the array
         * - array has atleast three elements.
         */

        if (kv < *elem) {
                /* keyval is smaller than elem */

                if (kv <= *(elem -1)) {
                        /* keyval is smaller than previous (of "elem") too.
                         * hence, elem cannot be closest.
                         */
                        return -1;
                } else {
                        /* check distance between elem and elem-prev.
                         * if distance with elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem -1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* remaining case: (keyval > *elem) */

        if (kv >= *(elem+1)) {
                /* keyval is larger than next (of "elem") too.
                 * hence, elem cannot be closest.
                 */
                return 1;
        }

        /* check distance between elem and elem-next.
         * if distance with elem is smaller, it is closest.
         */
        dist_right = *(elem+1) - kv;
        dist_left = kv - *elem;
        return (dist_left <= dist_right) ? 0:1;
}


int main(int argc, char **argv) {
        int arr[] = {10, 20, 30, 40, 50, 60, 70};
        int *found;
        struct key k;

        if (argc < 2) {
                return 1;
        }

        k.key_val = atoi(argv[1]);
        k.array_head = arr;
        k.array_size = sizeof(arr)/sizeof(int);

        found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int),
                compar);

        if(found) {
                printf("found closest: %d\n", *found);
        } else {
                printf("closest not found. absurd! \n");
        }

        return 0;
}
NearestValueSearch = function(x, w){
  ## A simple binary search algo
  ## Assume the w vector is sorted so we can use binary search
  left = 1
  right = length(w)
  while(right - left > 1){
    middle = floor((left + right) / 2)
    if(x < w[middle]){
      right = middle
    }
    else{
      left = middle
    }
  }
  if(abs(x - w[right]) < abs(x - w[left])){
    return(right)
  }
  else{
    return(left)
  }
}


x = 4.5
w = c(1,2,4,6,7)
NearestValueSearch(x, w) # return 3
> library(MALDIquant)
> match.closest(x, w)
[1] 3