R 速度优化-使用距离矩阵计算data.table中的加权列_R_Data.table_Distance

R 速度优化-使用距离矩阵计算data.table中的加权列

R 速度优化-使用距离矩阵计算data.table中的加权列,r,data.table,distance,R,Data.table,Distance,我试图将权重应用于data.table中的数值向量。权重来自每个点与所有其他点的欧几里德距离。如果一个点与另一个点接近，则分配给它们的权重将更高，如果两点之间的距离大于阈值，则权重将为0，分配给点与自身之间距离的权重为10000 我可以用下面的代码来说明： library(data.table) library(dplyr) library(tictoc) set.seed(42) df <- data.table( LAT = rnorm(500, 42), LONG

我试图将权重应用于

data.table

中的数值向量。权重来自每个点与所有其他点的欧几里德距离。如果一个点与另一个点接近，则分配给它们的权重将更高，如果两点之间的距离大于阈值，则权重将为0，分配给点与自身之间距离的权重为10000

我可以用下面的代码来说明：

library(data.table)
library(dplyr)
library(tictoc)

set.seed(42)
df <- data.table(
    LAT = rnorm(500, 42),
    LONG = rnorm(500, -72),
    points = rnorm(500)
    )
df2 <- copy(df) # for new solution
d <- as.matrix(dist(df[, .(LAT, LONG)])) # compute distance matrix

# function to calculate the weights
func <- function(j, cols, threshold) {
    N <- which(d[j, ] <= threshold) # find points whose distances are below threshold
    K <- (1 / (d[j, N] ^ 2)) # calculate weights, which are inversely proportional to distance, lower distance, higher the weight
    K[which(d[j, N] == 0)] <- 10000 # weight to itself is 10000
    return((K%*% as.matrix(df[N, ..cols])) / sum(K)) # compute weighted point for 1 row
}

tic('Old way')
# compute the weighted point calculation for every row
result <- tapply(1:nrow(df), 1:nrow(df), function(i) func(i, 'points', 0.5))
df[, 'weighted_points' := result] # assign the results back to data.table
toc()

库（data.table）
图书馆（dplyr）
图书馆（tictoc）
种子（42）
df从数据分析师的角度来看，我认为您可以通过近似计算平均距离和接近点来改进代码
有一次，我与NCDC站点合作，试图找到彼此的站点，因为站点太多，这很耗时。我想出了一个主意，在我得到每个点的坐标后，我只需对它们进行排序，并设置阈值“我想计算实际重量的站数”
例如，在排名之后，取50个最近的点（在排名中）并分别赋予它们权重，其他点的权重仅为0
希望这有帮助旁注：数学和数据中的符号。表是x[i，j]。。。如果不必要地反转i和j，肯定会让人困惑。如果是主要问题，您可能需要使用一个专门的软件包（例如），因为比较N^2个lat-long对的组合问题会爆炸。
d <- as.matrix(dist(df[, .(LAT, LONG)]))
df2[, 'weighted_points' := points]
dt <- as.data.table(d)
cols <- names(dt)

tic('New way')
# compute the weights
dt[, (cols) := lapply(.SD, function(x) case_when(
    x == 0 ~ 10000, 
    x <= 0.5 ~ 1 / (x^2), 
    TRUE ~ 0)), .SDcols = cols]

# compute the weighted point for each row
for (i in 1L:nrow(dt)) {
    set(df2, i, 'weighted_points', value = sum(df2[['points']] * dt[[i]]) / sum(dt[[i]])) 
}
toc()

round(sum(df$weighted_points - df2$weighted_points), 0)