Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/sql-server-2005/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 快速计算欧氏距离_R_Performance_Distance_Missing Data_Euclidean Distance - Fatal编程技术网

R 快速计算欧氏距离

R 快速计算欧氏距离,r,performance,distance,missing-data,euclidean-distance,R,Performance,Distance,Missing Data,Euclidean Distance,我想用30000个观测值计算数据帧行之间的欧几里德距离。一种简单的方法是dist函数(例如dist(data))。然而,由于我的数据帧很大,这需要花费太多的时间 某些行包含缺少的值。我不需要行与行之间的距离,其中两行都包含缺少的值,或者行与行之间的距离,其中没有一行包含缺少的值 在for循环中,我试图排除不需要的组合。不幸的是,我的解决方案需要更多的时间: # Some example data data <- data.frame( x1 = c(1, 22, NA, NA, 15,

我想用30000个观测值计算数据帧行之间的欧几里德距离。一种简单的方法是dist函数(例如dist(data))。然而,由于我的数据帧很大,这需要花费太多的时间

某些行包含缺少的值。我不需要行与行之间的距离,其中两行都包含缺少的值,或者行与行之间的距离,其中没有一行包含缺少的值

在for循环中,我试图排除不需要的组合。不幸的是,我的解决方案需要更多的时间:

# Some example data
data <- data.frame(
  x1 = c(1, 22, NA, NA, 15, 7, 10, 8, NA, 5),
  x2 = c(11, 2, 7, 15, 1, 17, 11, 18, 5, 5),
  x3 = c(21, 5, 6, NA, 10, 22, 12, 2, 12, 3),
  x4 = c(13, NA, NA, 20, 12, 5, 1, 8, 7, 14)
)


# Measure speed of dist() function
start_time_dist <- Sys.time()

# Calculate euclidean distance with dist() function for complete dataset
dist_results <- dist(data)

end_time_dist <- Sys.time()
time_taken_dist <- end_time_dist - start_time_dist


# Measure speed of my own loop
start_time_own <- Sys.time()

# Calculate euclidean distance with my own loop only for specific cases

# # # 
# The following code should be faster!
# # # 

data_cc <- data[complete.cases(data), ]
data_miss <- data[complete.cases(data) == FALSE, ]

distance_list <- list()

for(i in 1:nrow(data_miss)) {

  distances <- numeric()
  for(j in 1:nrow(data_cc)) {
    distances <- c(distances, dist(rbind(data_miss[i, ], data_cc[j, ]), method = "euclidean"))
  }

  distance_list[[i]] <- distances
}

end_time_own <- Sys.time()
time_taken_own <- end_time_own - start_time_own


# Compare speed of both calculations
time_taken_dist # 0.002001047 secs
time_taken_own # 0.01562881 secs
#一些示例数据

数据我建议您使用并行计算。将所有代码放在一个函数中并并行执行

默认情况下,R将在一个线程中完成所有计算。您应该手动添加并行线程。在R中启动集群需要时间,但是如果您有大的数据帧,那么主作业的性能将(您的处理器数量1)提高两倍

此链接也可能有帮助:和

好的选择是将您的作业划分为更小的包,并在每个线程中分别计算它们。只创建一次线程,因为在R中这很耗时

library(parallel)
library(foreach)
library(doParallel)
# I am not sure that all libraries are here
# try ??your function to determine which library do you need
# determine how many processors has your computer
no_cores <- detectCores() - 1# one processor must be free always for system
start.t.total<-Sys.time()
print(start.t.total)
startt<-Sys.time()
print(startt)
#start parallel calculations
cl<-makeCluster(no_cores,outfile = "mycalculation_debug.txt")
registerDoParallel(cl)
# results will be in out.df class(dataframe)
out.df<-foreach(p=1:no_cores
                    ,.combine=rbind # data from different threads will be in one table
                    ,.packages=c()# All packages that your funtion is using must be called here
                    ,.inorder=T) %dopar% #don`t forget this directive
                    {
                      tryCatch({
                          #
                          # enter your function here and do what you want in parallel
                          #
                          print(startt-Sys.time())
                          print(start.t.total-Sys.time())
                          print(paste(date,'packet',p, percent((x-istart)/packes[p]),'done'))
                        }
                        out.df
                      },error = function(e) return(paste0("The variable '", p, "'", 
                                                          " caused the error: '", e, "'")))
                    }
stopCluster(cl)
gc()# force to free memory from killed processes
库(并行)
图书馆(foreach)
图书馆(双平行)
#我不确定所有的图书馆都在这里
#尝试??您的函数来确定您需要哪个库
#确定您的计算机有多少个处理器

C中没有实现任何内核dist,当然它比R for循环快。你应该在Rcpp中实现你的循环。谢谢你的提示!我会努力弄明白这是怎么回事。非常感谢你的回答,这对我帮助很大!我甚至不知道这在R中是可能的,并且将尝试实现您的解决方案!我认为
amap
包在这里可能会很有帮助,如果您不想创建自己的函数,请查看这个