在Python中检查较长字符串中存在的模糊/近似子字符串？_Python_Python 2.7_Fuzzy Search

在Python中检查较长字符串中存在的模糊/近似子字符串？

python python-2.7

在Python中检查较长字符串中存在的模糊/近似子字符串？,python,python-2.7,fuzzy-search,Python,Python 2.7,Fuzzy Search,使用像LeveEinstein（LeveEinstein或difflib）这样的算法，很容易找到近似匹配 >>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571 模糊匹配可以通过根据需要确定阈值来检测当前要求：根据较大字符串中的阈值查找模糊子字符串例如一种蛮力解决方案是生成长度为N-1到N+1（或其他匹配长度）

使用像LeveEinstein（LeveEinstein或difflib）这样的算法，很容易找到近似匹配

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

模糊匹配可以通过根据需要确定阈值来检测

当前要求：根据较大字符串中的阈值查找模糊子字符串

例如

一种蛮力解决方案是生成长度为N-1到N+1（或其他匹配长度）的所有子字符串，其中N是查询字符串的长度，然后逐个使用levenstein并查看阈值

python中是否有更好的解决方案，最好是python 2.7中包含的模块或外部可用的模块

--------------更新和解决方案-------------------

Python正则表达式模块工作得很好，尽管对于模糊子字符串情况，它比内置的

re

模块稍微慢一点，这是由于额外的操作而产生的明显结果。期望的输出是好的，并且可以很容易地定义对模糊程度的控制

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

导入正则表达式 >>>input=“蒙娜丽莎由莱昂纳多·达芬奇绘制”

>>>regex.search（r'\b（leonardo）{e如何使用

difflib.SequenceMatcher.get_matching_blocks

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

更新

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

上面的代码打印：

['manhatan'，'manhattn']

最近我为Python编写了一个对齐库：

使用它，您可以在任意序列对上使用任意评分策略执行全局和局部对齐。实际上，在您的情况下，您需要半局部对齐，因为您不关心

query\u string

的子字符串。我在下面的代码中使用局部对齐和一些启发式模拟了半局部算法，但我很容易扩展库以实现正确的实现

下面是为您的案例修改的自述文件中的示例代码

from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)

# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)

# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)

# Iterate over optimal alignments and print them.
for encoded in encodeds:
    alignment = v.decodeSequenceAlignment(encoded)

    # Simulate a semi-local alignment.
    if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
        continue
    if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
        continue
    if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
        continue

    print alignment
    print 'Alignment score:', alignment.score
    print 'Percent identity:', alignment.percentIdentity()
    print

minScore=5

的输出如下所示

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

如果删除

minScore

参数，您将只获得最佳得分匹配

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

请注意，库中的所有算法都具有

O（n*m）

时间复杂度，

和

是序列的长度。

即将取代re的新正则表达式库包括模糊匹配

模糊匹配语法看起来相当有表现力，但这将为您提供一个带有一个或更少插入/添加/删除的匹配

import regex
regex.match('(amazing){e<=1}', 'amaging')

导入正则表达式
regex.match（'（惊人）{e我使用它基于阈值进行模糊匹配，并从匹配中模糊提取单词
process.extractBests
接受查询、单词列表和截止分数，并返回匹配元组列表和高于截止分数的分数
find_near_matches
获取过程的结果。提取
并返回单词的开始和结束索引。我使用索引构建单词，并使用构建的单词在大字符串中查找索引。find_near_matches
的最大距离是“Levenshtein distance”，必须调整为su这是我们的需要
from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of 
    tuples of (word,index)
    '''
    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

要测试：
query_string = "manhattan"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "citi"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "greet"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

输出：
query: manhattan  
string: thelargemanhatanproject is a great project in themanhattincity  
match: manhatan  
index: 8  
match: manhattin  
index: 49  

query: citi  
string: thelargemanhatanproject is a great project in themanhattincity  
match: city  
index: 58  

query: greet  
string: thelargemanhatanproject is a great project in themanhattincity  
match: great  
index: 29 

上面的方法很好，但我需要在很多干草中找到一根小针，结果是这样接近它：
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs

needle = "this is the string we want to find"
hay    = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"

needle_length  = len(needle.split())
max_sim_val    = 0
max_sim_string = u""

for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
    hay_ngram = u" ".join(ngram)
    similarity = SM(None, hay_ngram, needle).ratio() 
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram

print max_sim_val, max_sim_string

收益率：
0.72972972973 this string is the one we wanted to find

如何从块中检索模糊匹配的子字符串？例如“manhatan”@DhruvPathak，a=“thelargemanhatanproject”；b=“manhattan”；s=difflib.SequenceMatcher（None，a，b）；“”.join（a[i:i+n]表示s中的i，j，n。获取匹配的块（）如果n）
它不会从大字符串中提取“manhatan”，它会导致查询字符串“manhattan”（双t）@DhruvPathak，？我评论中的代码产生'manhatan'
（单t）您的代码是否也可以扩展为提供多个子字符串，如“我的任务”中的示例编辑所示？regex
解决方案确实适用于给定的示例。您对此有什么问题吗？FWIW，对于打算添加到标准库中的版本，模糊匹配可能会被删除…如果它实际进入，那么是的。我无法用OP的“manhattan”示例实现这一点——您能展示使其工作的代码吗？遗憾的是，regex.match（'（测试）{e@AwaisHussain-您是否尝试过regex.search（'（test）{eindex=ls.find（match）将只返回第一次出现的值。非常好，但当前不符合-因此您需要添加：if len（ls）
（以下）。
0.72972972973 this string is the one we wanted to find