R和Java在计算最近邻方面存在差异+；韦卡_Java_R_Weka_Knn

R和Java在计算最近邻方面存在差异+；韦卡

java r

R和Java在计算最近邻方面存在差异+；韦卡,java,r,weka,knn,Java,R,Weka,Knn,我正在调试一个库和另一个涉及计算k近邻的实现。我用一个我很难理解的例子来阐述这个问题首先，我将用一个玩具示例来解释和演示这个问题，然后展示将引出问题的输出任务这里的演示读取一个包含10个二维数据点的csv文件。任务是找到所有数据点与第一个数据点之间的距离，并以非递减顺序列出所有点以及与第一个数据点之间的距离基本上，这是一个基于KNN的算法的组成部分，当我执行java版本（库的组成部分）时，我发现了一个差异，当我在R中写它时，为了证明差异，请考虑下面的代码。代码1:Java+WEKA 下

我正在调试一个库和另一个涉及计算k近邻的实现。我用一个我很难理解的例子来阐述这个问题

首先，我将用一个玩具示例来解释和演示这个问题，然后展示将引出问题的输出

任务这里的演示读取一个包含10个二维数据点的csv文件。任务是找到所有数据点与第一个数据点之间的距离，并以非递减顺序列出所有点以及与第一个数据点之间的距离

基本上，这是一个基于KNN的算法的组成部分，当我执行java版本（库的组成部分）时，我发现了一个差异，当我在R中写它时，为了证明差异，请考虑下面的代码。代码1:Java+WEKA 下面的代码使用Java和。我曾经计算过最近的邻居。之所以使用它，是因为在我正在调试和/或与R代码进行比较的特定库中使用了

import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.core.DistanceFunction;
import weka.core.EuclideanDistance;
import weka.core.Instances;
import weka.core.neighboursearch.LinearNNSearch;
import java.io.File;

class testnn
{
  public static void main (String args[]) throws Exception
  {
    // Load csv
    CSVLoader loader = new CSVLoader ();
    loader.setSource (new File (args[0]));

    Instances df = loader.getDataSet ();

    // Set the LinearNNSearch object
    EuclideanDistance dist_obj = new EuclideanDistance ();

    LinearNNSearch lnn = new LinearNNSearch ();
    lnn.setDistanceFunction(dist_obj);
    lnn.setInstances(df);
    lnn.setMeasurePerformance(false);

    // Compute the K-nearest neighbours of the first datapoint (index 0).
    Instances knn_pts = lnn.kNearestNeighbours (df.instance (0), df.numInstances ());

    // Get the distances.
    double [] dist_arr = lnn.getDistances ();

    // Print
    System.out.println ("Points sorted in increasing order from ");
    System.out.println (df.instance (0));
    System.out.println ("V1,\t" + "V2,\t" + "dist");
    for (int j = 0; j < knn_pts.numInstances (); j++)
    {
      System.out.println (knn_pts.instance (j) + "," + dist_arr[j]);
    }
  }
}

导入weka.core.converters.CSVLoader；
导入weka.core.Instances；
导入weka.core.distance函数；
导入weka.core.EuclideanDistance；
导入weka.core.Instances；
导入weka.core.neightoursearch.linearnsearch；
导入java.io.File；
类testnn
{
公共静态void main（字符串args[]）引发异常
{
//加载csv
CSVLoader loader=新CSVLoader（）；
loader.setSource（新文件（args[0]）；
实例df=loader.getDataSet（）；
//设置LinearnSearch对象
欧氏距离=新欧氏距离（）；
linearnsearch lnn=新的linearnsearch（）；
lnn.setDistanceFunction（距离对象）；
lnn.setInstances（df）；
lnn.setMeasurePerformance（假）；
//计算第一个数据点（索引0）的K近邻。
实例knn_pts=lnn.kNearestNeighbours（df.instance（0），df.numInstances（））；
//了解距离。
double[]dist_arr=lnn.getdistance（）；
//印刷品
System.out.println（“按递增顺序排序的点”）；
System.out.println（df.instance（0））；
System.out.println（“V1\t”+“V2\t”+“dist”）；
对于（int j=0；j



代码2:R
计算我使用的距离。使用也可以得到相同的答案
// Read file
df <- read.csv ("dat.csv", header = TRUE);

// All to all distances, and select distances of points from  first datapoint (index 1)
dist_mat <- as.matrix (dist (df, diag=TRUE, upper=TRUE, method="euclidean"));
first_pt_to_all <- dist_mat[,1];

// Sort the datapoints and also record the ordering
sorted_order <- sort (first_pt_to_all, index.return = TRUE, decreasing = FALSE);

// Prepare dataset with the datapoints ordered in the non-decreasing order of the distance from the first datapoint
df_sorted <- cbind (df[sorted_order$ix[-1],], dist = sorted_order$x[-1]);

// Print
print ("Points sorted in increasing order from ");
print (df[1,]);

print (df_sorted);

//读取文件
如注释中所述，R距离是正确的。问题在于WEKA默认值。您使用了：
EuclideanDistance dist_obj = new EuclideanDistance ();

WEKA中的欧几里德距离具有默认参数。其中之一是DontNormalize=FALSE
，即默认情况下，WEKA在计算距离之前对数据进行标准化。我在java中没有太多帮助，所以我将在R中这样做。如果您缩放数据，使每个变量的最小值为零，最大值为一，您将获得WEKA提供的距离度量
NData = Data
NData[,1] = (NData[,1]-min(NData[,1]))/(max(NData[,1])-min(NData[,1]))
NData[,2] = (NData[,2]-min(NData[,2]))/(max(NData[,2])-min(NData[,2]))
dist(NData)

这些距离与您为WEKA显示的距离相匹配。要获得与R相同的结果，请查看WEKA中欧几里德距离的参数。
很容易看出R距离是正确的。例如，仅使用测试点和列表中的第一个点：p1=c（0.560954，0.313231）；p2=c（0.866816,0.476897）；sqrt（总和（（p1-p2）*（p1-p2））；[1] 0.3468979@G5W毫无疑问，R距离是正确的。尽管问题依然存在，但WEKA有什么问题？还是它被错误地使用了？我刚刚检查了getDontNormalize（）
返回false
。让我再仔细研究一下。是的，因此我需要找到可能的方法来阻止这一切。谢谢你的引导。这让人困惑，因为getDontNormalize（）
返回false
。是的，它有点双重否定。DontNormalize=FALSE与Normalize=TRUE（但该参数不称为Normalize）是相同的。现在，我将setDontNormalize
设置为TRUE。上次我在设置标志后忘记注册。琐碎的事情，花了很多时间。谢谢你追踪它。
EuclideanDistance dist_obj = new EuclideanDistance ();

NData = Data
NData[,1] = (NData[,1]-min(NData[,1]))/(max(NData[,1])-min(NData[,1]))
NData[,2] = (NData[,2]-min(NData[,2]))/(max(NData[,2])-min(NData[,2]))
dist(NData)