实现一个简单的Trie以实现高效的Levenshtein距离计算-Java 更新3
完成了。下面是最终通过我所有测试的代码。同样,这是模仿Murilo Vasconcelo对Steve Hanov算法的修改版本。感谢所有的帮助实现一个简单的Trie以实现高效的Levenshtein距离计算-Java 更新3,java,algorithm,performance,trie,levenshtein-distance,Java,Algorithm,Performance,Trie,Levenshtein Distance,完成了。下面是最终通过我所有测试的代码。同样,这是模仿Murilo Vasconcelo对Steve Hanov算法的修改版本。感谢所有的帮助 /** * Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the * words stored in theTrie. This algorithm is modeled afte
/**
* Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
* words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
* distance using a Trie" and Murilo Vasconcelo's revised version in C++.
*
* http://stevehanov.ca/blog/index.php?id=114
* http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
*
* @param ArrayList<Character> word - the characters of an input word as an array representation
* @return int - the minimum Levenshtein Distance
*/
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int iWordLength = word.size();
int[] currentRow = new int[iWordLength + 1];
for (int i = 0; i <= iWordLength; i++) {
currentRow[i] = i;
}
for (int i = 0; i < iWordLength; i++) {
traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
return theTrie.minLevDist;
}
/**
* Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
*
* @param TrieNode node - the current TrieNode
* @param char letter - the current character of the current word we're working with
* @param ArrayList<Character> word - an array representation of the current word
* @param int[] previousRow - a row in the Levenshtein Distance matrix
*/
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int minimumElement = currentRow[0];
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.get(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
if (currentRow[i] < minimumElement) {
minimumElement = currentRow[i];
}
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minimumElement < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
traverseTrie(node.children.get(c), c, word, currentRow);
}
}
}
/**
*计算给定单词(表示为字符数组)和
*存储在磁带中的单词。该算法是根据Steve Hanov的博客文章“Fast and Easy Levenshtein”建模的
*使用TIE距离和Murilo Vasconcelo的C++修订版。
*
* http://stevehanov.ca/blog/index.php?id=114
* http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
*
*@param ArrayList word-作为数组表示形式的输入字的字符
*@return int-最小Levenshtein距离
*/
private int computeminimumlevenshteInstance(数组列表字){
theTrie.minLevDist=Integer.MAX_值;
int-iWordLength=word.size();
int[]currentRow=新的int[iWordLength+1];
对于(int i=0;i),我用C++中的一个文章实现了“快速简便的LevsHeTin距离”的ALGO,它非常快。如果你想(理解C++优于Python),我可以在某个地方通过代码。
编辑:
我把它贴在我的网站上。下面是一个(编辑:移动到)的示例。这些可能也会有帮助:
编辑:以上链接似乎已移动到github:
看起来实验性的Lucene代码是基于包的
用法似乎与以下类似:
LevenshteinAutomata builder = new LevenshteinAutomata(s);
Automaton automata = builder.toAutomaton(n);
boolean result1 = BasicOperations.run(automata, "foo");
boolean result2 = BasicOperations.run(automata, "bar");
从我所能告诉你的,你不需要提高Levenshtein距离的效率,你需要把你的字符串存储在一个结构中,这样你就不需要多次进行距离计算,也就是说,删减搜索空间
由于Levenshtein距离是一个度量,您可以使用任何利用三角形不等式的度量空间索引-您提到了BK树,但还有其他树,如有利点树、固定查询树、平分线树、空间近似树。以下是它们的描述:
Burkhard-Keller树
节点按如下方式插入到树中:
对于根节点,选择一个任意元素
从空间中;添加唯一的带标签的边
使每个边的值
从轴到轴的距离
元素;递归应用,选择
当已创建边时,将子对象作为轴
存在
固定查询树
与BKTs一样,除了:存储元素
在叶上;每个叶具有多个元素;
对于树的每一层,都使用相同的轴
用过
平分线树
每个节点包含两个枢轴元素
其覆盖半径(最大
中心元件和中心元件之间的距离
它的任何子树元素);过滤为两个
设置最接近的元素
第一个支点和最靠近该支点的支点
其次,递归地构建两个子树
从这些集合中
空间近似树
最初,所有元素都在一个袋子中;选择
构建作为轴心的任意元素;构建
最近的邻居的集合
枢轴的范围;放置每个剩余的
元素放入最近的
元素从刚构建的集合添加到它;
递归地从每个
此集合的元素
有利位置树
从集合中任意选择一个轴;
计算这两者之间的中间距离
枢轴和其余的每个元素
集合;从集合到左侧的过滤器元件
和右递归子树,这样
距离小于或等于
中间带形成左侧和更大的区域
从右边走
我的直觉告诉我,每个三元组都应该存储它所代表的字符串,并引用字母表中的字母,而不是所有的字母。我的直觉正确吗
不,trie不表示字符串,它表示一组字符串(及其所有前缀)。trie节点将输入字符映射到另一个trie节点。因此,它应该包含字符数组和相应的三节点引用数组。(可能不是那么精确的表达,这取决于你在使用它时的效率。)好吧,很久以前。
我将字典存储为一个trie,它只是一个受限于树形式的有限状态机。
您可以通过不设置该限制来增强它。
例如,公共后缀可以只是一个共享子树。
你甚至可以有循环,捕捉像“国家”、“国家”、“国有化”、“国有化”之类的东西
尽可能使trie绝对简单。不要在其中填充字符串
请记住,这样做不是为了找到两个给定字符串之间的距离。您可以使用它在字典中查找最接近一个给定字符串的字符串。所需时间取决于您可以容忍的levenshtein距离。对于距离零,它只是O(n),其中n是单词长度。对于任意距离,它是O(n)其中N是字典中的字数。在我看来,您希望循环遍历trie的所有分支。使用递归函数并不难。我在k-最近邻算法中也使用了trie,使用的是相同的函数。我不懂Java,但这里有一些伪代码:
function walk (testitem trie)
make an empty array results
function compare (testitem children distance)
if testitem = None
place the distance and children into results
else compare(testitem from second position,
the sub-children of the first child in children,
if the first item of testitem is equal to that
of the node of the first child of children
add one to the distance (! non-destructive)
else just the distance)
when there are any children left
compare (testitem, the children without the first item,
distance)
compare(testitem, children of root-node in trie, distance set to 0)
return the results
希望有帮助。函数walk需要一个testitem(例如,一个可索引字符串或一个字符数组)和一个trie。trie可以是一个具有两个插槽的对象。一个指定trie的节点,另一个指定该节点的子节点。子节点也是tries。在python中类似于:
class Trie(object):
def __init__(self, node=None, children=[]):
self.node = node
self.children = children
或者用Lisp
(defstruct trie (node nil) (children nil))
现在,trie看起来像这样:
(trie #node None
#children ((trie #node f
#children ((trie #node o
#children ((trie #node o
#children None)))
(trie #node u
#children ((trie #node n
#children None)))))))
现在,内部函数(您也可以单独编写)接受testitem,即根的子项
function walk (testitem trie)
make an empty array results
function compare (testitem children distance)
if testitem = None
place the distance and children into results
else compare(testitem from second position,
the sub-children of the first child in children,
if the first item of testitem is equal to that
of the node of the first child of children
add one to the distance (! non-destructive)
else just the distance)
when there are any children left
compare (testitem, the children without the first item,
distance)
compare(testitem, children of root-node in trie, distance set to 0)
return the results
class Trie(object):
def __init__(self, node=None, children=[]):
self.node = node
self.children = children
(defstruct trie (node nil) (children nil))
(trie #node None
#children ((trie #node f
#children ((trie #node o
#children ((trie #node o
#children None)))
(trie #node u
#children ((trie #node n
#children None)))))))
Trie dict = new Trie();
dict.insert("arb");
dict.insert("area");
ArrayList<Character> word = new ArrayList<Character>();
word.add('a');
word.add('r');
word.add('c');
if (word.get(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
for (int i = 0; i < iWordLength; i++) {
traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
traverseTrie(theTrie.root, ' ', word, currentRow);
Given:
dict is a dictionary represented as a DFA (ex. trie or dawg)
dictState is a state in dict
dictStartState is the start state in dict
dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
editDistance is an edit distance
laWord is a word
la is a Levenshtein Automaton defined for laWord and editDistance
laState is a state in la
laStartState is the start state in la
laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
charSequence is a sequence of chars
traversalDataStack is a stack of (dictState, laState, charSequence) tuples
Define dictState as dictStartState
Define laState as laStartState
Push (dictState, laState, "") on to traversalDataStack
While traversalDataStack is not empty
Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
Define currentDictState as the dictState in currentTraversalDataTuple
Define currentLAState as the laState in currentTraversalDataTuple
Define currentCharSequence as the charSequence in currentTraversalDataTuple
For each char in alphabet
Check if currentDictState has outgoing transition labeled by char
Check if currentLAState has outgoing transition labeled by char
If both currentDictState and currentLAState have outgoing transitions labeled by char
Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
Define newCharSequence as concatenation of currentCharSequence and char
Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
Add newCharSequence to resultSet
endIf
endIf
endFor
endWhile