Java中的相似字符串比较

Java中的相似字符串比较,java,string-comparison,Java,String Comparison,我想比较几个字符串,找出最相似的字符串。我想知道是否有任何库、方法或最佳实践可以返回与其他字符串更相似的字符串。例如: “敏捷的狐狸跳了”->“狐狸跳了” “敏捷的狐狸跳了起来”->“狐狸” 这种比较结果表明,第一种方法比第二种方法更为相似 我想我需要一些方法,比如: double similarityIndex(String s1, String s2) 某处有这样的东西吗 编辑:我为什么要这样做?我正在编写一个脚本,将MS项目文件的输出与处理任务的某些遗留系统的输出进行比较。由于传统系

我想比较几个字符串,找出最相似的字符串。我想知道是否有任何库、方法或最佳实践可以返回与其他字符串更相似的字符串。例如:

  • “敏捷的狐狸跳了”->“狐狸跳了”
  • “敏捷的狐狸跳了起来”->“狐狸”
这种比较结果表明,第一种方法比第二种方法更为相似

我想我需要一些方法,比如:

double similarityIndex(String s1, String s2)
某处有这样的东西吗


编辑:我为什么要这样做?我正在编写一个脚本,将MS项目文件的输出与处理任务的某些遗留系统的输出进行比较。由于传统系统的字段宽度非常有限,因此在添加值时,描述会缩短。我想要一些半自动的方法来找出MS Project中哪些条目与系统中的条目相似,这样我就可以得到生成的密钥。它有缺点,因为它仍然需要手动检查,但它将节省大量工作

您可以使用Levenshtein距离来计算两个字符串之间的差异。

理论上,你可以进行比较。

是的,有许多记录良好的算法,如:

  • 余弦相似性
  • 贾卡相似性
  • 骰子系数
  • 匹配相似性
  • 重叠相似性
  • 等等
一个好的总结(“Sam的字符串度量”)(原始链接已失效,因此它链接到Internet存档)

同时检查以下项目:


这通常是通过测量来完成的。搜索“EditDistanceJava”会找到许多库,例如。

如果字符串变成文档,我觉得这听起来像是一个错误。也许用这个词搜索会发现一些好东西

“规划集体智慧”有一章是关于确定两份文件是否相似的。代码是用Python编写的,但它干净且易于移植。

我将代码翻译成JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};
String.prototype.LevenshteinDistance=函数(s2){
var数组=新数组(this.length+1);
对于(var i=0;i
以0%-100%的方式计算两个字符串之间的相似性的常用方法是测量要将较长字符串转换为较短字符串所需的更改量(单位%):

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below
输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

感谢第一位回答者,我认为有两种计算ComputeDitDistance(s1,s2)。由于花费了大量的时间,决定提高代码的性能。因此:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}
公共类levenshteindication{
公共静态int computedDistance(字符串s1、字符串s2){
s1=s1.toLowerCase();
s2=s2.toLowerCase();
int[]成本=新的int[s2.length()+1];
对于(int i=0;i 0){
成本[s2.length()]=lastValue;
}
}
退货成本[s2.length()];
}
公共静态无效打印距离(字符串s1、字符串s2){
字符串的双重相似性=0.0;
int editDistance=0;
如果(s1.length()”+s2+”:“+
//编辑距离+“(“+字符串的相似性+”);
System.out.println(editDistance+“(“+similarityOfStrings+”);
}
公共静态void main(字符串[]args){
打印距离(“,”);
打印距离(“1234567890”,“1”);
打印距离(“1234567890”,“12”);
打印距离(“1234567890”、“123”);
打印距离(“1234567890”、“1234”);
打印距离(“1234567890”、“12345”);
打印距离(“1234567890”、“123456”);
打印距离(“1234567890”、“1234567”);
打印距离(“1234567890”、“12345678”);
打印距离(“1234567890”、“123456789”);
打印距离(“1234567890”、“1234567890”);
打印距离(“1234567890”、“1234567980”);
打印距离(“47/2010”、“472010”);
printDistance(“47/2010”、“472011”);
打印距离(“47/2010”、“AB.CDEF”);
printDistance(“47/2010”、“4B.CDEFG”);
打印距离(“47/2010”、“AB.CDEFG”);
printDistance(“快速狐狸跳”,“狐狸跳”);
printDistance(“快速跳跃的狐狸”,“狐狸”);
printDistance(“狐狸跳得很快”,
“敏捷的狐狸从阳台上跳下来”);
打印距离(“小猫”、“坐着”);
打印距离(“rosettacode”、“raisethysword”);
printDistance(新StringBuilder(“rosettacode”).reverse().toString(),
新建StringBuilder(“raisethysword”).reverse().toString();
对于(int i=1;i
确实有很多字符串相似性度量:

  • Levenshtein编辑距离
  • Damerau-Levenshtein距离
  • Jaro-Winkler相似性
  • 最长公共子序列编辑距离
  • Q-克(Ukkonen)
  • n-克距离(康德拉克)
  • 雅卡指数
  • 索伦森骰子系数public class LevenshteinDistance { public static int computeEditDistance(String s1, String s2) { s1 = s1.toLowerCase(); s2 = s2.toLowerCase(); int[] costs = new int[s2.length() + 1]; for (int i = 0; i <= s1.length(); i++) { int lastValue = i; for (int j = 0; j <= s2.length(); j++) { if (i == 0) { costs[j] = j; } else { if (j > 0) { int newValue = costs[j - 1]; if (s1.charAt(i - 1) != s2.charAt(j - 1)) { newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1; } costs[j - 1] = lastValue; lastValue = newValue; } } } if (i > 0) { costs[s2.length()] = lastValue; } } return costs[s2.length()]; } public static void printDistance(String s1, String s2) { double similarityOfStrings = 0.0; int editDistance = 0; if (s1.length() < s2.length()) { // s1 should always be bigger String swap = s1; s1 = s2; s2 = swap; } int bigLen = s1.length(); editDistance = computeEditDistance(s1, s2); if (bigLen == 0) { similarityOfStrings = 1.0; /* both strings are zero length */ } else { similarityOfStrings = (bigLen - editDistance) / (double) bigLen; } ////////////////////////// //System.out.println(s1 + "-->" + s2 + ": " + // editDistance + " (" + similarityOfStrings + ")"); System.out.println(editDistance + " (" + similarityOfStrings + ")"); } public static void main(String[] args) { printDistance("", ""); printDistance("1234567890", "1"); printDistance("1234567890", "12"); printDistance("1234567890", "123"); printDistance("1234567890", "1234"); printDistance("1234567890", "12345"); printDistance("1234567890", "123456"); printDistance("1234567890", "1234567"); printDistance("1234567890", "12345678"); printDistance("1234567890", "123456789"); printDistance("1234567890", "1234567890"); printDistance("1234567890", "1234567980"); printDistance("47/2010", "472010"); printDistance("47/2010", "472011"); printDistance("47/2010", "AB.CDEF"); printDistance("47/2010", "4B.CDEFG"); printDistance("47/2010", "AB.CDEFG"); printDistance("The quick fox jumped", "The fox jumped"); printDistance("The quick fox jumped", "The fox"); printDistance("The quick fox jumped", "The quick fox jumped off the balcany"); printDistance("kitten", "sitting"); printDistance("rosettacode", "raisethysword"); printDistance(new StringBuilder("rosettacode").reverse().toString(), new StringBuilder("raisethysword").reverse().toString()); for (int i = 1; i < args.length; i += 2) { printDistance(args[i - 1], args[i]); } } }