自动(循环)R中的欧氏距离测量

自动(循环)R中的欧氏距离测量,r,loops,automation,export-to-csv,euclidean-distance,R,Loops,Automation,Export To Csv,Euclidean Distance,AIM:我想自动(循环)下面的代码,而不必为每个示例手动运行它。我有一个很糟糕的习惯,就是用base写长篇大论,需要开始使用循环,我发现这很难实现 数据:我有两个数据框:一个是样本数据(样本),另一个是参考数据(ref)。 它们都包含相同的变量(x,y,z) 代码描述:对于每个样本(样本$sample\u name),我想计算它到参考数据中每个案例的欧氏距离。然后使用结果对参考数据进行重新排序,以显示在欧几里德(三维)空间中哪些点与样本数据点“最近” 我当前的代码允许我简单地替换示例名称(即“s

AIM:我想自动(循环)下面的代码,而不必为每个示例手动运行它。我有一个很糟糕的习惯,就是用base写长篇大论,需要开始使用循环,我发现这很难实现

数据:我有两个数据框:一个是样本数据(样本),另一个是参考数据(ref)。 它们都包含相同的变量(x,y,z

代码描述:对于每个样本(样本$sample\u name),我想计算它到参考数据中每个案例的欧氏距离。然后使用结果对参考数据进行重新排序,以显示在欧几里德(三维)空间中哪些点与样本数据点“最近”

我当前的代码允许我简单地替换示例名称(即“s1”),然后重新运行代码,对.csv文件的文件名进行最后一次更改。输出是参考数据列表,按最接近样本的顺序排列(在欧几里德空间中)

我想自动化这个过程(进入循环?),这样我就可以使用示例名称列表(samples$sample_name)在两个数据帧上运行它,并希望能够自动导出到.csv文件

任何帮助都将不胜感激

# Reference data
country<-c("Austria","Austria","Italy","Italy","Turkey","Romania","France")
x<-c(18.881,18.881,18.929,19.139,19.008,19.083,18.883)
y<-c(15.627,15.627,15.654,15.772,15.699,15.741,15.629)
z<-c(38.597,38.597,38.842,39.409,39.048,39.224,38.740)
pb_age<-c(-106,-106,-87,-6,-55,-26,-104)
ref<-data.frame(country,x,y,z,pb_age) # Reference data

# Sample data (for euclidean measurements against Reference data)
sample_name<-c("s1","s2","s3")
x2<-c(18.694,18.729,18.731)
y2<-c(15.682,15.683,15.677)
z2<-c(38.883,38.989,38.891)
pb_age2<-c(120,97,82)
samples<-data.frame(sample_name,x2,y2,z2,pb_age2) # Sample data
colnames(samples)<-c("sample_name","x","y","z","pb_age") # To match Reference data headings

# Euclidean distance measurements
library(fields) # Need package for Euclidean distances

# THIS IS WHAT I WANT TO AUTOMATE/LOOP (BELOW)...
# Currently, I have to update the 'id' for each sample to get a result (for each sample)

id<-"s1"  # Sample ID - this is simply changed so the following code can be re-run for each sample

# The code
x1<-samples[which(samples$sample_name==id),c("x","y","z")]
x2<-ref[,c("x","y","z")]

result_distance<-rdist(x1,x2) # Computing the Euclidean distance
result_distance<-as.vector(result_distance) # Saving the results as a vector

euclid_ref<-data.frame(result_distance,ref) # Creating a new data.frame adding the Euclidean distances to the original Reference data
colnames(euclid_ref)[1]<-"euclid_distance" # Updating the column name for the result

# Saving and exporting the results
results<-euclid_ref[order(euclid_ref$euclid_distance),] # Re-ordering the data.frame by the euclide distances, smallest to largest
write.csv(results, file="s1.csv")   # Ideally, I want the file name to be the same as the SAMPLE id, i.e. s1, s2, s3...
参考数据
country循环足够简单,但更类似R的解决方案是利用矢量化和应用函数系列:

result_distances <- data.frame(t(rdist(samples[, 2:4], ref[, 2:4])), ref)
colnames(result_distances)[1:3] <- rep("euclid_distance", 3)
# str(result_distances)
# 'data.frame': 7 obs. of  8 variables:
#  $ euclid_distance: num  0.346 0.346 0.24 0.695 0.355 ...
#  $ euclid_distance: num  0.424 0.424 0.25 0.594 0.286 ...
#  $ euclid_distance: num  0.334 0.334 0.205 0.666 0.319 ...
#  $ country        : chr  "Austria" "Austria" "Italy" "Italy" ...
#  $ x              : num  18.9 18.9 18.9 19.1 19 ...
#  $ y              : num  15.6 15.6 15.7 15.8 15.7 ...
#  $ z              : num  38.6 38.6 38.8 39.4 39 ...
#  $ pb_age         : num  -106 -106 -87 -6 -55 -26 -104

这将创建3个文件,“s1.csv”、“s2.csv”、“s3.csv”。

这里有一个循环,使用原始输入数据和代码的关键部分,计算参考数据位置的所有样本的欧氏距离。它比矢量化的apply解决方案要详细一点,但可能更容易阅读,因为它没有那么简洁和嵌套。最终输出为单个数据帧

# prepare an empty list object to store the results
output <- vector("list", length = nrow(samples))

  # this is the start of the loop
  for(i in seq_len(nrow(samples))){
   # we can read this as 'for row i of the samples dataframe, do this...'

    # get coords for sample i
    sample_coords <- samples[i ,c("x","y","z")]
    
    # get coords for all reference locations
    # this line would be fine above the loop
    # since it gives the same result for each 
    # iteration. I place it here to echo your
    # original workflow
    ref_coords <- ref[,c("x","y","z")]
    
    # compute Euclidean distance and coerce to vector, 
    e_dist_vec <- as.vector(rdist(sample_coords, ref_coords))
    
    # store in data frame
    e_dist_ref_df <- data.frame(e_dist_vec,  ref) 
    
    # update colname
    colnames(e_dist_ref_df)[1] <- "euclid_distance"
    
    # order df by euclid_distance values
    results <- e_dist_ref_df[order(e_dist_ref_df$euclid_distance),]
    
    #  store results for sample i in the list
    output[[i]] <- results
    
  } # this is the end of the loop


# assign sample names to list items
names(output) <- samples$sample_name
通常,将其放在单个数据框中进行进一步分析很方便,下面是一种方法:

# bind list dfs into one big data frame, not sure what the one-line equivalent in base R is
output_df <- dplyr::bind_rows(output, .id = "sample_id")

谢谢你。这是一种将结果合并到数据框中的简洁方法,用于快速重新排序/分析R-在代码方面也很紧凑!感谢这一点——这很好,也很容易理解,实际上对我理解循环和以前的解决方法有很大帮助。和您一样,用R打印输出对于快速查看最近的邻居非常有用-我经常这样做-因此非常有用!我愿意接受这两个答案都是有用的(两个答案都有各自的优点)——但我只想勾选一个@Rchaeologist不用担心,希望在以后的工作中帮助您编写其他循环!
# prepare an empty list object to store the results
output <- vector("list", length = nrow(samples))

  # this is the start of the loop
  for(i in seq_len(nrow(samples))){
   # we can read this as 'for row i of the samples dataframe, do this...'

    # get coords for sample i
    sample_coords <- samples[i ,c("x","y","z")]
    
    # get coords for all reference locations
    # this line would be fine above the loop
    # since it gives the same result for each 
    # iteration. I place it here to echo your
    # original workflow
    ref_coords <- ref[,c("x","y","z")]
    
    # compute Euclidean distance and coerce to vector, 
    e_dist_vec <- as.vector(rdist(sample_coords, ref_coords))
    
    # store in data frame
    e_dist_ref_df <- data.frame(e_dist_vec,  ref) 
    
    # update colname
    colnames(e_dist_ref_df)[1] <- "euclid_distance"
    
    # order df by euclid_distance values
    results <- e_dist_ref_df[order(e_dist_ref_df$euclid_distance),]
    
    #  store results for sample i in the list
    output[[i]] <- results
    
  } # this is the end of the loop


# assign sample names to list items
names(output) <- samples$sample_name
> output
$s1
  euclid_distance country      x      y      z pb_age
3       0.2401874   Italy 18.929 15.654 38.842    -87
7       0.2428559  France 18.883 15.629 38.740   -104
1       0.3461069 Austria 18.881 15.627 38.597   -106
2       0.3461069 Austria 18.881 15.627 38.597   -106
5       0.3551197  Turkey 19.008 15.699 39.048    -55
6       0.5206563 Romania 19.083 15.741 39.224    -26
4       0.6948388   Italy 19.139 15.772 39.409     -6

$s2
  euclid_distance country      x      y      z pb_age
3       0.2499000   Italy 18.929 15.654 38.842    -87
5       0.2856186  Turkey 19.008 15.699 39.048    -55
7       0.2977129  France 18.883 15.629 38.740   -104
1       0.4241509 Austria 18.881 15.627 38.597   -106
2       0.4241509 Austria 18.881 15.627 38.597   -106
6       0.4288415 Romania 19.083 15.741 39.224    -26
4       0.5936506   Italy 19.139 15.772 39.409     -6

$s3
  euclid_distance country      x      y      z pb_age
3       0.2052657   Italy 18.929 15.654 38.842    -87
7       0.2195655  France 18.883 15.629 38.740   -104
5       0.3191583  Turkey 19.008 15.699 39.048    -55
1       0.3338203 Austria 18.881 15.627 38.597   -106
2       0.3338203 Austria 18.881 15.627 38.597   -106
6       0.4887627 Romania 19.083 15.741 39.224    -26
4       0.6661929   Italy 19.139 15.772 39.409     -6

# bind list dfs into one big data frame, not sure what the one-line equivalent in base R is
output_df <- dplyr::bind_rows(output, .id = "sample_id")
> output_df
   sample_id euclid_distance country      x      y      z pb_age
1         s1       0.2401874   Italy 18.929 15.654 38.842    -87
2         s1       0.2428559  France 18.883 15.629 38.740   -104
3         s1       0.3461069 Austria 18.881 15.627 38.597   -106
4         s1       0.3461069 Austria 18.881 15.627 38.597   -106
5         s1       0.3551197  Turkey 19.008 15.699 39.048    -55
6         s1       0.5206563 Romania 19.083 15.741 39.224    -26
7         s1       0.6948388   Italy 19.139 15.772 39.409     -6
8         s2       0.2499000   Italy 18.929 15.654 38.842    -87
9         s2       0.2856186  Turkey 19.008 15.699 39.048    -55
10        s2       0.2977129  France 18.883 15.629 38.740   -104
11        s2       0.4241509 Austria 18.881 15.627 38.597   -106
12        s2       0.4241509 Austria 18.881 15.627 38.597   -106
13        s2       0.4288415 Romania 19.083 15.741 39.224    -26
14        s2       0.5936506   Italy 19.139 15.772 39.409     -6
15        s3       0.2052657   Italy 18.929 15.654 38.842    -87
16        s3       0.2195655  France 18.883 15.629 38.740   -104
17        s3       0.3191583  Turkey 19.008 15.699 39.048    -55
18        s3       0.3338203 Austria 18.881 15.627 38.597   -106
19        s3       0.3338203 Austria 18.881 15.627 38.597   -106
20        s3       0.4887627 Romania 19.083 15.741 39.224    -26
21        s3       0.6661929   Italy 19.139 15.772 39.409     -6