Python在两个字符串列表之间获得最大对齐

Python在两个字符串列表之间获得最大对齐,python,Python,这里有两个编号字符串列表: list1=[[0, u'html'], [1, u'head'], [2, u'title'], [3, u'link'], [4, u'link'], [5, u'link'], [6, u'meta'], [7, u'meta'], [8, u'meta'], [9, u'meta'], [10, u'meta'], [11, u'meta'], [12, u'script'], [13, u'body'], [14, u'ul'], [15, u'li'],

这里有两个编号字符串列表:

list1=[[0, u'html'], [1, u'head'], [2, u'title'], [3, u'link'], [4, u'link'], [5, u'link'], [6, u'meta'], [7, u'meta'], [8, u'meta'], [9, u'meta'], [10, u'meta'], [11, u'meta'], [12, u'script'], [13, u'body'], [14, u'ul'], [15, u'li'], [16, u'a'], [17, u'li'], [18, u'a'], [19, u'li'], [20, u'a'], [21, u'li'], [22, u'a'], [23, u'div'], [24, u'a'], [25, u'div'], [26, u'div'], [27, u'form'], [28, u'fieldset'], [29, u'legend'], [30, u'input'], [31, u'input'], [32, u'label'], [33, u'input'], [34, u'input'], [35, u'ul'], [36, u'li'], [37, u'a'], [38, u'li'], [39, u'a'], [40, u'div'], [41, u'span'], [42, u'h1'], [43, u'a'], [44, u'ul'], [45, u'li'], [46, u'a'], [47, u'li'], [48, u'a'], [49, u'li'], [50, u'a'], [51, u'li'], [52, u'a'], [53, u'li'], [54, u'a'], [55, u'li'], [56, u'a'], [57, u'li'], [58, u'a'], [59, u'div'], [60, u'h1'], [61, u'p'], [62, u'strong'], [63, u'strong'], [64, u'strong'], [65, u'strong'], [66, u'strong'], [67, u'p'], [68, u'a'], [69, u'a'], [70, u'p'], [71, u'a'], [72, u'a'], [73, u'a'], [74, u'a'], [75, u'a'], [76, u'a'], [77, u'abbr'], [78, u'p'], [79, u'abbr'], [80, u'a'], [81, u'abbr'], [82, u'div'], [83, u'h1'], [84, u'h2'], [85, u'ul'], [86, u'li'], [87, u'a'], [88, u'li'], [89, u'a'], [90, u'li'], [91, u'a'], [92, u'li'], [93, u'a'], [94, u'li'], [95, u'a'], [96, u'li'], [97, u'a'], [98, u'li'], [99, u'a'], [100, u'h2'], [101, u'ul'], [102, u'li'], [103, u'a'], [104, u'li'], [105, u'a'], [106, u'li'], [107, u'a'], [108, u'li'], [109, u'a'], [110, u'li'], [111, u'a'], [112, u'li'], [113, u'a'], [114, u'li'], [115, u'a'], [116, u'h2'], [117, u'ul'], [118, u'li'], [119, u'a'], [120, u'li'], [121, u'a'], [122, u'li'], [123, u'a'], [124, u'div'], [125, u'p'], [126, u'a'], [127, u'p'], [128, u'span'], [129, u'span'], [130, u'a'], [131, u'a'], [132, u'script']]

list2=[[0, u'html'], [1, u'head'], [2, u'title'], [3, u'link'], [4, u'link'], [5, u'link'], [6, u'link'], [7, u'meta'], [8, u'meta'], [9, u'meta'], [10, u'meta'], [11, u'meta'], [12, u'meta'], [13, u'script'], [14, u'body'], [15, u'ul'], [16, u'li'], [17, u'a'], [18, u'li'], [19, u'a'], [20, u'li'], [21, u'a'], [22, u'li'], [23, u'a'], [24, u'div'], [25, u'a'], [26, u'div'], [27, u'div'], [28, u'form'], [29, u'fieldset'], [30, u'legend'], [31, u'input'], [32, u'input'], [33, u'label'], [34, u'input'], [35, u'input'], [36, u'ul'], [37, u'li'], [38, u'a'], [39, u'li'], [40, u'a'], [41, u'div'], [42, u'span'], [43, u'h1'], [44, u'a'], [45, u'ul'], [46, u'li'], [47, u'a'], [48, u'li'], [49, u'a'], [50, u'li'], [51, u'a'], [52, u'li'], [53, u'a'], [54, u'li'], [55, u'a'], [56, u'li'], [57, u'a'], [58, u'li'], [59, u'a'], [60, u'li'], [61, u'a'], [62, u'div'], [63, u'h1'], [64, u'p'], [65, u'strong'], [66, u'strong'], [67, u'strong'], [68, u'strong'], [69, u'strong'], [70, u'p'], [71, u'a'], [72, u'a'], [73, u'p'], [74, u'a'], [75, u'a'], [76, u'a'], [77, u'a'], [78, u'a'], [79, u'span'], [80, u'a'], [81, u'abbr'], [82, u'p'], [83, u'span'], [84, u'abbr'], [85, u'span'], [86, u'a'], [87, u'div'], [88, u'h1'], [89, u'h2'], [90, u'p'], [91, u'a'], [92, u'h2'], [93, u'ul'], [94, u'li'], [95, u'a'], [96, u'li'], [97, u'a'], [98, u'li'], [99, u'a'], [100, u'li'], [101, u'a'], [102, u'li'], [103, u'a'], [104, u'li'], [105, u'a'], [106, u'li'], [107, u'a'], [108, u'h2'], [109, u'ul'], [110, u'li'], [111, u'a'], [112, u'li'], [113, u'a'], [114, u'li'], [115, u'a'], [116, u'li'], [117, u'a'], [118, u'li'], [119, u'a'], [120, u'li'], [121, u'a'], [122, u'li'], [123, u'a'], [124, u'h2'], [125, u'span'], [126, u'ul'], [127, u'li'], [128, u'a'], [129, u'span'], [130, u'li'], [131, u'a'], [132, u'span'], [133, u'li'], [134, u'a'], [135, u'span'], [136, u'div'], [137, u'p'], [138, u'a'], [139, u'p'], [140, u'span'], [141, u'span'], [142, u'a'], [143, u'a'], [144, u'script']]
因此,为了评估它们的对齐程度,我使用以下代码:

counter=0
offset=0
aligned_counter=0
for i1,string1 in list1:
    i2=i1+offset #maybe we can vary this if the two strings do not match
    string2=list2[i2][1]
    if string2==string1:
        aligned_counter+=1
    counter+=1

alignment_score=float(aligned_counter)/counter

Sow我是否可以对齐两个列表,以便对齐大多数列表项(例如:[0,0],[1,1]…[27,28]…[132144])?如果两个字符串不相等,除了改变偏移量之外,还有更好的方法吗(考虑到有时第一个列表中的某些链接可能会从第二个列表中丢失?

如果我理解正确,这就是经典的序列对齐问题(请参阅)。有一些直接的动态规划算法可以获得某种“最优”解决方案,例如,如果我理解正确,请参见,这就是经典的序列比对问题(请参见)。有一些直接的动态规划算法可以获得某种“最优”解决方案,例如,请参见

对于“大多数列表项已对齐”有许多可能的正式定义,因此您将陷入痛苦的世界;-)

这里有一种只使用标准库的方法。首先,我注意到列表中的整数似乎是无用的,对吗?也就是说,这些断言成功了:

assert [t[0] for t in list1] == range(len(list1))
assert [t[0] for t in list2] == range(len(list2))
每个
[integer,string]
对中的第一个元素就是列表中元素的索引。我不知道他们为什么在那里,但他们只是碍事。所以这里的代码忽略了它们:

import difflib
s = difflib.SequenceMatcher(None,
                            [t[1] for t in list1],
                            [t[1] for t in list2])
for b in s.get_matching_blocks():
    print "exact match of length", b.size, "starting at indices", b.a, "and", b.b
这显示:

exact match of length 3 starting at indices 0 and 0
exact match of length 56 starting at indices 3 and 4
exact match of length 17 starting at indices 59 and 62
exact match of length 3 starting at indices 76 and 80
exact match of length 1 starting at indices 79 and 84
exact match of length 1 starting at indices 80 and 86
exact match of length 2 starting at indices 82 and 87
exact match of length 33 starting at indices 84 and 92
exact match of length 3 starting at indices 117 and 126
exact match of length 2 starting at indices 120 and 130
exact match of length 2 starting at indices 122 and 133
exact match of length 9 starting at indices 124 and 136
exact match of length 0 starting at indices 133 and 145
正如文档所解释的,最终返回的“匹配块”总是一个大小为0的伪块

.get_matching_blocks()
实现了一种“局部对齐”(参见其他人提供给您的常规链接)。它可能是你想要的,也可能不是你想要的。但至少它已经为您编码;-)

对于“大多数列表项都是对齐的”有很多可能的正式定义,因此您将陷入一个痛苦的世界;-)

这里有一种只使用标准库的方法。首先,我注意到列表中的整数似乎是无用的,对吗?也就是说,这些断言成功了:

assert [t[0] for t in list1] == range(len(list1))
assert [t[0] for t in list2] == range(len(list2))
每个
[integer,string]
对中的第一个元素就是列表中元素的索引。我不知道他们为什么在那里,但他们只是碍事。所以这里的代码忽略了它们:

import difflib
s = difflib.SequenceMatcher(None,
                            [t[1] for t in list1],
                            [t[1] for t in list2])
for b in s.get_matching_blocks():
    print "exact match of length", b.size, "starting at indices", b.a, "and", b.b
这显示:

exact match of length 3 starting at indices 0 and 0
exact match of length 56 starting at indices 3 and 4
exact match of length 17 starting at indices 59 and 62
exact match of length 3 starting at indices 76 and 80
exact match of length 1 starting at indices 79 and 84
exact match of length 1 starting at indices 80 and 86
exact match of length 2 starting at indices 82 and 87
exact match of length 33 starting at indices 84 and 92
exact match of length 3 starting at indices 117 and 126
exact match of length 2 starting at indices 120 and 130
exact match of length 2 starting at indices 122 and 133
exact match of length 9 starting at indices 124 and 136
exact match of length 0 starting at indices 133 and 145
正如文档所解释的,最终返回的“匹配块”总是一个大小为0的伪块


.get_matching_blocks()
实现了一种“局部对齐”(参见其他人提供给您的常规链接)。它可能是你想要的,也可能不是你想要的。但至少它已经为您编码;-)

是的,这是序列对齐,那么有没有任何标准的python方法来实现这一点,或者我需要自己开发一个?@hmghay你可以在wiki页面上看到伪代码,但是可能已经有python实现在Web上了。就我所能看到的维基百科文章,谢谢你的建议,但显然它有一些错误:文件“C:\Python26\lib\site packages\alignment\profile.py”,第35行返回{e:float(w)/t for e,w in self.\uu weights.iteritems()}^SyntaxError:invalid syntaxyes,这是序列对齐,那么,有没有标准的pythonic方法可以做到这一点,或者我需要自己开发一个?@hmghaly你可以在wiki页面上看到伪代码,但是web上可能已经有python实现了。据我所知,有Wikipedia文章。谢谢你的建议,但显然它有一些bug:File“C:\Python26\lib\site packages\alignment\profile.py”,第35行返回{e:float(w)/t表示e,w在self中。u weights.iteritems()}^语法错误:无效语法