Python difflib模块,以及用字符替换每个标记的原则。 但我意识到,您的问题只需使用re即可解决,而不必使用任意字符进行替换 进口 资料
我使用元组能够在执行期间以更可读的形式显示数据 请注意,我稍微修改了数据以避免一些问题:Python difflib模块,以及用字符替换每个标记的原则。 但我意识到,您的问题只需使用re即可解决,而不必使用任意字符进行替换 进口 资料,python,regex,string,Python,Regex,String,我使用元组能够在执行期间以更可读的形式显示数据 请注意,我稍微修改了数据以避免一些问题: 在框架和之间为期间, 大调Patprogram 7在两个字符串中,以此类推 Norte还说,我在短语和xmltext中添加了一系列字符####(在2014-2015年日期之前),以表明我的代码在这种情况下仍然有效。其他答案无法解决这种情况 词组 XML文本 执行 结果是: ********************************************************* *********
在框架和之间为期间,
大调Patprogram 7在两个字符串中,以此类推 Norte还说,我在
短语和xmltext
中添加了一系列字符####(在2014-2015年日期之前),以表明我的代码在这种情况下仍然有效。其他答案无法解决这种情况
词组
XML文本
执行
结果是:
*********************************************************
********* Searching for 'foobar' in samples *************
*********************************************************
fo##o##ba###r## aaaaaBLfoob##arAH
(0, 13) fo##o##ba###r
(23, 31) foob##ar
#fo##o##ba###r## aaaaaBLfoob##arAH
(1, 14) fo##o##ba###r
(24, 32) foob##ar
BLAHHfo##o##ba###r BLfoob##arAH
(5, 18) fo##o##ba###r
(23, 31) foob##ar
BLAH#fo##o##ba###rBLUHYfoob##arAH
(5, 18) fo##o##ba###r
(23, 31) foob##ar
BLA# fo##o##ba###rBLyyyfoob##ar
(5, 18) fo##o##ba###r
(23, 31) foob##ar
BLA# fo##o##ba###rBLy##foob##ar
(5, 18) fo##o##ba###r
(23, 31) foob##ar
kjhfqshqsk
-::: Not found :::-
..........................................
使用以下代码,我检查了您的问题:
import urllib
sock = urllib.urlopen('http://stackoverflow.com/'
'questions/17381982/'
'python-regex-catastrophic-backtracking-where')
r =sock.read()
sock.close()
i = r.find('unpredictable, such as the following')
j = r.find('in order to match the following phrase')
k = r.find('I came up with this regex ')
print 'i == %d j== %d' % (i,j)
print repr(r[i:j])
print
print 'j == %d k== %d' % (j,k)
print repr(r[j:k])
结果是:
i == 10408 j== 10714
'unpredictable, such as the following:</p>\n\n<blockquote>\n Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n \n <p>accomplishment (c)#######</p>\n</blockquote>\n\n<p>so '
j == 10714 k== 10955
'in order to match the following phrase:</p>\n\n<blockquote>\n <p>Relationship to the strategic framework for the period 2014-2015:\n programme 7, Economic and Social Affairs, subprogramme 3, expected\n accomplishment (c)</p>\n</blockquote>\n\n<p>'
i==10408 j==10714
不可预测的、以下以下如下:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\\\\35甲甲甲甲甲甲甲、甲甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲\n\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n关系关系关系关系关系关系关系关系关系关系关系关系关系到本本本本甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲、甲在2014-2015年期间,本基金会为2014-2015年期间,本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为2014-2015年期间,本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金会为本基金'
j==10714 k==10955
“为了匹配以下短语:\n\n\n与2014-2015年期间战略框架的关系:\n方案7,经济和社会事务,次级方案3,预期成绩(c)\n\n”
请注意计划7前面的附加\n
,成就前面的附加\n
,计划7和计划7之间的差异,以及在该期间的字符串框架中框架和之间存在两个空格
这可以解释您在示例中遇到的困难。以下代码显示FMc的代码不起作用
线路
从文件名导入旧的新的,查找
指的是我在这个问题的个人答案中的代码。
*将文件名name\u指定给包含我的代码脚本的文件(位于该线程的另一个答案中),它将运行。
*或者,如果您对复制粘贴我的代码感到恼火,只需注释这一行导入,下面的代码将运行,因为我放置了一个try-except指令,该指令将正确地响应没有旧的
和查找的情况
我使用了两种方法来验证FMc代码的结果:
-1/当我们搜索短语“foobar”时,将他的代码返回的span与索引'f'和索引'r'进行比较,我发现除了foobar中的span外,没有f和r
-2/与我的代码返回的第一个跨度相比,因此需要从name\u\u\u\u文件导入上述内容
诺塔贝尼酒店
如果将disp=None
更改为disp==True
,执行将显示有助于理解算法的中间结果
结果
---------------------------------------------
FMc的代码非常微妙,我花了很长时间才理解它的原理,然后才能纠正它。
我会让任何人来理解这个算法。我只是说,为了使FMc的代码正常工作,需要进行以下更正:
第一项更正:
if s + w - 1 < start:
# must be changed to
if s + w - 1 <= start or (s==start):
-------------------
令我震惊的是,尽管FMc给出了一个错误的代码,但还是有两个人对FMc的答案投了赞成票。这意味着他们在没有测试给定代码的情况下对答案进行了投票
----------------------------------------
编辑
为什么必须将s+w-1
的条件更改为此条件:
如果s+w-1,为什么不删除所有的#
并将连续的空格规范化为一个呢?这样你可以做一个简单的子字符串匹配,而不是使用正则表达式。这是一个糟糕的设计。非常糟糕的设计。我不明白为什么用“#”块替换标签是可以的,但完全去掉它们是不可以的。你到底在用你的正则表达式做什么?你是在扫描一个更大的文档来查找这个特定的段落(你需要知道它的位置),还是在测试多个单独的文档来找到一个匹配的?@mishik这就是你在将ms word保存为XML时得到的结果停止用正则表达式解析XML。不,我想保留这些磅键,删除它们会影响我正在搜索的文本的完整性,我需要进一步执行您的_字符串。replace('#','')
不会覆盖您的字符串,而是创建一个新字符串,以便您可以保留旧字符串供以后使用,并在新返回的字符串上使用更简单的正则表达式。有趣,但是如何从中获得跨度呢?您可以通过使用枚举
进行迭代,并简单地保存第一个和最后一个匹配字符的索引,在输入字符串中找到匹配的位置。@Blckknght,在此之前,您需要知道匹配的第一个字符。如果你需要在行中找到“abc”,你不应该简单地取第一个“a”,啊,这是一个很好的观点。如果存在部分匹配(即使只是第一个模式字符),那么您将遇到麻烦。我想您可以重置计数器
变量,但它仍然有点混乱。这是一项令人愉快的工作,因为我喜欢Python。这确实是一个简单的代码但你似乎并不特别满意现在有一个能给出你想要的结果的解决方案。问题是我发现已经有了一个可行的答案,我接受了它,所以我真的很感谢你付出这么大的努力来想出另一个答案:)谢谢“我需要的基本上是这个短语在实际范围内的跨度
def compute_span(span_start, search_width, widths, is_marker):
span_end = span_start + search_width - 1
to_consume = span_start + search_width
start_is_fixed = False
for w in widths:
if is_marker:
# Shift start and end rightward.
span_start += (0 if start_is_fixed else w)
span_end += w
else:
# Reduce amount of non-marker text we need to consume.
# As that amount gets smaller, we'll first fix the
# location of the span_start, and then stop.
to_consume -= w
if to_consume < search_width:
start_is_fixed = True
if to_consume <= 0: break
# Toggle the flag.
is_marker = not is_marker
return [span_start, span_end]
def main():
tests = [
# 0123456789012345678901234567890123456789
( [None, None], '' ),
( [ 0, 5], 'foobar' ),
( [ 0, 5], 'foobar###' ),
( [ 3, 8], '###foobar' ),
( [ 2, 7], '##foobar###' ),
( [25, 34], 'BLAH ##BLAH fo####o##ba##foo###b#ar' ),
( [12, 26], 'BLAH ##BLAH fo####o##ba###r## BL##AH' ),
( [None, None], 'jkh##jh#f' ),
( [ 1, 12], '#f#oo##ba###r##' ),
( [ 4, 15], 'a##xf#oo##ba###r##' ),
( [ 4, 15], 'ax##f#oo##ba###r##' ),
( [ 7, 18], 'ab###xyf#oo##ba###r##' ),
( [ 7, 18], 'abx###yf#oo##ba###r##' ),
( [ 7, 18], 'abxy###f#oo##ba###r##' ),
( [ 8, 19], 'iji#hkh#f#oo##ba###r##' ),
( [ 8, 19], 'mn##pps#f#oo##ba###r##' ),
( [12, 23], 'mn##pab###xyf#oo##ba###r##' ),
( [12, 23], 'lmn#pab###xyf#oo##ba###r##' ),
( [ 0, 12], 'fo##o##ba###r## aaaaaBLfoob##arAH' ),
( [ 0, 12], 'fo#o##ba####r## aaaaaBLfoob##ar#AH' ),
( [ 0, 12], 'f##oo##ba###r## aaaaaBLfoob##ar' ),
( [ 0, 12], 'f#oo##ba####r## aaaaBL#foob##arAH' ),
( [ 0, 12], 'f#oo##ba####r## aaaaBL#foob##ar#AH' ),
( [ 0, 12], 'foo##ba#####r## aaaaBL#foob##ar' ),
( [ 1, 12], '#f#oo##ba###r## aaaBL##foob##arAH' ),
( [ 1, 12], '#foo##ba####r## aaaBL##foob##ar#AH' ),
( [ 2, 12], '#af#oo##ba##r## aaaBL##foob##ar' ),
( [ 3, 13], '##afoo##ba###r## aaaaaBLfoob##arAH' ),
( [ 5, 17], 'BLAHHfo##o##ba###r aaBLfoob##ar#AH' ),
( [ 5, 17], 'BLAH#fo##o##ba###r aaBLfoob##ar' ),
( [ 5, 17], 'BLA#Hfo##o##ba###r###BLfoob##ar' ),
( [ 5, 17], 'BLA#Hfo##o##ba###r#BL##foob##ar' ),
]
for exp, t in tests:
span = do_it('foobar', t, verbose = True)
if exp != span:
print '\n0123456789012345678901234567890123456789'
print t
print n
print dict(got = span, exp = exp)
main()
import re
tu_phrase = ('Relationship to the ',
'strategic framework ',
'for the period ###2014-2015',
': Programme 7, Economic and Social Affairs, ',
'subprogramme 3, expected accomplishment (c)')
phrase = ''.join(tu_phrase)
tu_xmltext = ('EEEEEEE',
'<w:rPr>',
'AAAAAAA',
'</w:rPr><w:t>',
'Relationship to the ',
'</w:t></w:r><w:r>',
'<w:rPr><w:i/>',
'<w:sz w:val="17"/><w:sz-cs w:val="17"/>'
'strategic framework ',
'</w:t></w:r><w:r wsp:rsidRPr="00EC3076">',
'<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
'</w:rPr><w:t>',
'for the period ###2014-2015',
'</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr>',
'<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
'</w:rPr><w:t>',
': Programme 7, Economic and Social Affairs, ',
'subprogramme 3, expected accomplishment (c)',
'</w:t>',
'321354641331')
xmltext = ''.join(tu_xmltext)
def olding_the_new(stuvw,pat_for_sub):
triples = []
pmod = 0 # pmod = position in modified stuvw,
# that is to say in re.sub(pat_for_sub,'',stuvw)
for mat in re.finditer('{0}|([\s\S]+?)(?={0}|\Z)'.format(pat_for_sub),
stuvw):
if mat.group(1):
triples.append((pmod,mat.end()-mat.start(),mat.start()))
pmod += mat.end()-mat.start()
return triples
def finding(LITTLE,BIG,pat_for_sub,
olding_the_new=olding_the_new):
triples = olding_the_new(BIG,'(?:%s)+' % pat_for_sub)
modBIG = re.sub(pat_for_sub,'',BIG)
modLITTLE = re.escape(LITTLE)
for mat in re.finditer(modLITTLE,modBIG):
st,nd = mat.span() # in modBIG
sori = -1 # start original, id est in BIG
for tr in triples:
if st < tr[0]+tr[1] and sori<0:
sori = tr[2] + st - tr[0]
if nd<=tr[0]+tr[1]:
yield(sori, tr[2] + nd - tr[0])
break
if __name__ == '__main__':
print ('---------- phrase ----------\n%s\n'
'\n------- phrase written in a readable form --------\n'
'%s\n\n\n'
'---------- xmltext ----------\n%s\n'
'\n------- xmltext written in a readable form --------\n'
'%s\n\n\n'
%
(phrase , '\n'.join(tu_phrase),
xmltext , '\n'.join(tu_xmltext)) )
print ('*********************************************************\n'
'********** Searching for phrase in xmltext **************\n'
'*********************************************************')
spans = finding(phrase,xmltext,'</?w:[^>]*>')
if spans:
for s,e in spans:
print ("\nspan in string 'xmltext' : (%d , %d)\n\n"
'xmltext[%d:%d] :\n%s'
% (s,e,s,e,xmltext[s:e]))
else:
print ("-::: The first string isn't in second string :::-")
*********************************************************
********** Searching for phrase in xmltext **************
*********************************************************
span in string 'xmltext' : (34 , 448)
xmltext[34:448] :
Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>for the period ###2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)
print ('\n*********************************************************\n'
"********* Searching for 'foobar' in samples *************\n"
'*********************************************************')
for xample in ('fo##o##ba###r## aaaaaBLfoob##arAH',
'#fo##o##ba###r## aaaaaBLfoob##arAH',
'BLAHHfo##o##ba###r BLfoob##arAH',
'BLAH#fo##o##ba###rBLUHYfoob##arAH',
'BLA# fo##o##ba###rBLyyyfoob##ar',
'BLA# fo##o##ba###rBLy##foob##ar',
'kjhfqshqsk'):
spans = list(finding('foobar',xample,'#'))
if spans:
print ('\n%s\n%s'
%
(xample,
'\n'.join('%s %s'
% (sp,xample[sp[0]:sp[1]])
for sp in spans))
)
else:
print ("\n%s\n-::: Not found :::-" % xample)
*********************************************************
********* Searching for 'foobar' in samples *************
*********************************************************
fo##o##ba###r## aaaaaBLfoob##arAH
(0, 13) fo##o##ba###r
(23, 31) foob##ar
#fo##o##ba###r## aaaaaBLfoob##arAH
(1, 14) fo##o##ba###r
(24, 32) foob##ar
BLAHHfo##o##ba###r BLfoob##arAH
(5, 18) fo##o##ba###r
(23, 31) foob##ar
BLAH#fo##o##ba###rBLUHYfoob##arAH
(5, 18) fo##o##ba###r
(23, 31) foob##ar
BLA# fo##o##ba###rBLyyyfoob##ar
(5, 18) fo##o##ba###r
(23, 31) foob##ar
BLA# fo##o##ba###rBLy##foob##ar
(5, 18) fo##o##ba###r
(23, 31) foob##ar
kjhfqshqsk
-::: Not found :::-
import urllib
sock = urllib.urlopen('http://stackoverflow.com/'
'questions/17381982/'
'python-regex-catastrophic-backtracking-where')
r =sock.read()
sock.close()
i = r.find('unpredictable, such as the following')
j = r.find('in order to match the following phrase')
k = r.find('I came up with this regex ')
print 'i == %d j== %d' % (i,j)
print repr(r[i:j])
print
print 'j == %d k== %d' % (j,k)
print repr(r[j:k])
i == 10408 j== 10714
'unpredictable, such as the following:</p>\n\n<blockquote>\n Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n \n <p>accomplishment (c)#######</p>\n</blockquote>\n\n<p>so '
j == 10714 k== 10955
'in order to match the following phrase:</p>\n\n<blockquote>\n <p>Relationship to the strategic framework for the period 2014-2015:\n programme 7, Economic and Social Affairs, subprogramme 3, expected\n accomplishment (c)</p>\n</blockquote>\n\n<p>'
import re
from name_of_file import olding_the_new,finding
def main():
# Two versions of the text: the original,
# and one without any of the "#" markers.
for text_orig in ('BLAH ##BLAH fo####o##ba###r## BL##AH',
'jkh##jh#f',
'#f#oo##ba###r##',
'a##xf#oo##ba###r##',
'ax##f#oo##ba###r##',
'ab###xyf#oo##ba###r##',
'abx###yf#oo##ba###r##',
'abxy###f#oo##ba###r##',
'iji#hkh#f#oo##ba###r##',
'mn##pps#f#oo##ba###r##',
'mn##pab###xyf#oo##ba###r##',
'lmn#pab###xyf#oo##ba###r##',
'fo##o##ba###r## aaaaaBLfoob##arAH',
'fo#o##ba####r## aaaaaBLfoob##ar#AH',
'f##oo##ba###r## aaaaaBLfoob##ar',
'f#oo##ba####r## aaaaBL#foob##arAH',
'f#oo##ba####r## aaaaBL#foob##ar#AH',
'foo##ba#####r## aaaaBL#foob##ar',
'#f#oo##ba###r## aaaBL##foob##arAH',
'#foo##ba####r## aaaBL##foob##ar#AH',
'#af#oo##ba##r## aaaBL##foob##ar',
'##afoo##ba###r## aaaaaBLfoob##arAH',
'BLAHHfo##o##ba###r aaBLfoob##ar#AH',
'BLAH#fo##o##ba###r aaBLfoob##ar',
'BLA#Hfo##o##ba###r###BLfoob##ar',
'BLA#Hfo##o##ba###r#BL##foob##ar',
):
text_clean = text_orig.replace('#', '')
# Collect data on the positions and widths
# of the markers in the original text.
rgx = re.compile(r'#+')
markers = [(m.start(), len(m.group()))
for m in rgx.finditer(text_orig)]
# Find the location of the search phrase in the cleaned text.
# At that point you'll have all the data you need to compute
# the span of the phrase in the original text.
search = 'foobar'
try:
i = text_clean.index(search)
print ('text_clean == %s\n'
"text_clean.index('%s')==%d len('%s') == %d\n"
'text_orig == %s\n'
'markers == %s'
% (text_clean,
search,i,search,len(search),
text_orig,
markers))
S,E = compute_span(i, len(search), markers)
print "span = (%d,%d) %s %s %s"\
% (S,E,
text_orig.index('f')==S,
text_orig.index('r')+1==E,
list(finding(search,text_orig,'#+')))
except ValueError:
print ('text_clean == %s\n'
"text_clean.index('%s') ***Not found***\n"
'text_orig == %s\n'
'markers == %s'
% (text_clean,
search,
text_orig,
markers))
print '--------------------------------'
def compute_span(start, width, markers):
# start and width are in expurgated text
# markers are in original text
disp = None # if disp==True => displaying of intermediary results
span_start = start
if disp:
print ('\nAt beginning in compute_span():\n'
' span_start==start==%d width==%d'
% (start,width))
for s, w in markers: # s and w are in original text
if disp:
print ('\ns,w==%d,%d'
' s+w-1(%d)<start(%d) %s'
' s(%d)==start(%d) %s'
% (s,w,s+w-1,start,s+w-1<start,s,start,s==start))
if s + w - 1 < start:
#mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
# the following if-else section is justified to be used
# only after correction of the above line to this one:
# if s+w-1 <= start or s==start:
#mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwm
if s + w - 1 <= start and disp:
print ' 1a) s + w - 1 (%d) <= start (%d) marker at left'\
% (s+w-1, start)
elif disp:
print ' 1b) s(%d) == start(%d)' % (s,start)
#mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
# Situation: marker fully to left of our text.
# Adjust our start points rightward.
start += w
span_start += w
if disp:
print ' span_start == %d start, width == %d, %d' % (span_start, start, width)
elif start + width - 1 < s:
if disp:
print (' 2) start + width - 1 (%d) < s (%d) marker at right\n'
' break' % (start+width-1, s))
# Situation: marker fully to the right of our text.
break
else:
# Situation: marker interrupts our text.
# Advance the start point for the remaining text
# rightward, and reduce the remaining width.
if disp:
print " 3) In 'else': s - start == %d marker interrupts" % (s - start)
start += w
width = width - (s - start)
if disp:
print ' span_start == %d start, width == %d, %d' % (span_start, start, width)
return (span_start, start + width)
main()
>>>
text_clean == BLAH BLAH foobar BLAH
text_clean.index('foobar')==10 len('foobar') == 6
text_orig == BLAH ##BLAH fo####o##ba###r## BL##AH
markers == [(5, 2), (14, 4), (19, 2), (23, 3), (27, 2), (32, 2)]
span = (12,26) True False [(12, 27)]
--------------------------------
text_clean == jkhjhf
text_clean.index('foobar') ***Not found***
text_orig == jkh##jh#f
markers == [(3, 2), (7, 1)]
--------------------------------
text_clean == foobar
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == #f#oo##ba###r##
markers == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2)]
span = (0,11) False False [(1, 13)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2 len('foobar') == 6
text_orig == a##xf#oo##ba###r##
markers == [(1, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,16) False True [(4, 16)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2 len('foobar') == 6
text_orig == ax##f#oo##ba###r##
markers == [(2, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,15) False False [(4, 16)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4 len('foobar') == 6
text_orig == ab###xyf#oo##ba###r##
markers == [(2, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19) False True [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4 len('foobar') == 6
text_orig == abx###yf#oo##ba###r##
markers == [(3, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,18) False False [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4 len('foobar') == 6
text_orig == abxy###f#oo##ba###r##
markers == [(4, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19) False True [(7, 19)]
--------------------------------
text_clean == ijihkhfoobar
text_clean.index('foobar')==6 len('foobar') == 6
text_orig == iji#hkh#f#oo##ba###r##
markers == [(3, 1), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18) False False [(8, 20)]
--------------------------------
text_clean == mnppsfoobar
text_clean.index('foobar')==5 len('foobar') == 6
text_orig == mn##pps#f#oo##ba###r##
markers == [(2, 2), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18) False False [(8, 20)]
--------------------------------
text_clean == mnpabxyfoobar
text_clean.index('foobar')==7 len('foobar') == 6
text_orig == mn##pab###xyf#oo##ba###r##
markers == [(2, 2), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24) False True [(12, 24)]
--------------------------------
text_clean == lmnpabxyfoobar
text_clean.index('foobar')==8 len('foobar') == 6
text_orig == lmn#pab###xyf#oo##ba###r##
markers == [(3, 1), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24) False True [(12, 24)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == fo##o##ba###r## aaaaaBLfoob##arAH
markers == [(2, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,9) True False [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == fo#o##ba####r## aaaaaBLfoob##ar#AH
markers == [(2, 1), (4, 2), (8, 4), (13, 2), (27, 2), (31, 1)]
span = (0,7) True False [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobar
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == f##oo##ba###r## aaaaaBLfoob##ar
markers == [(1, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,11) True False [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == f#oo##ba####r## aaaaBL#foob##arAH
markers == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2)]
span = (0,8) True False [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == f#oo##ba####r## aaaaBL#foob##ar#AH
markers == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2), (31, 1)]
span = (0,8) True False [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobar
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == foo##ba#####r## aaaaBL#foob##ar
markers == [(3, 2), (7, 5), (13, 2), (22, 1), (27, 2)]
span = (0,7) True False [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == #f#oo##ba###r## aaaBL##foob##arAH
markers == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2), (21, 2), (27, 2)]
span = (0,11) False False [(1, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0 len('foobar') == 6
text_orig == #foo##ba####r## aaaBL##foob##ar#AH
markers == [(0, 1), (4, 2), (8, 4), (13, 2), (21, 2), (27, 2), (31, 1)]
span = (0,12) False False [(1, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaBLfoobar
text_clean.index('foobar')==1 len('foobar') == 6
text_orig == #af#oo##ba##r## aaaBL##foob##ar
markers == [(0, 1), (3, 1), (6, 2), (10, 2), (13, 2), (21, 2), (27, 2)]
span = (2,10) True False [(2, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaaaBLfoobarAH
text_clean.index('foobar')==1 len('foobar') == 6
text_orig == ##afoo##ba###r## aaaaaBLfoob##arAH
markers == [(0, 2), (6, 2), (10, 3), (14, 2), (28, 2)]
span = (1,14) False True [(3, 14), (24, 32)]
--------------------------------
text_clean == BLAHHfoobar aaBLfoobarAH
text_clean.index('foobar')==5 len('foobar') == 6
text_orig == BLAHHfo##o##ba###r aaBLfoob##ar#AH
markers == [(7, 2), (10, 2), (14, 3), (27, 2), (31, 1)]
span = (5,14) True False [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobar aaBLfoobar
text_clean.index('foobar')==4 len('foobar') == 6
text_orig == BLAH#fo##o##ba###r aaBLfoob##ar
markers == [(4, 1), (7, 2), (10, 2), (14, 3), (27, 2)]
span = (4,16) False False [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4 len('foobar') == 6
text_orig == BLA#Hfo##o##ba###r###BLfoob##ar
markers == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 3), (27, 2)]
span = (5,14) True False [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4 len('foobar') == 6
text_orig == BLA#Hfo##o##ba###r#BL##foob##ar
markers == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 1), (21, 2), (27, 2)]
span = (5,14) True False [(5, 18), (23, 31)]
--------------------------------
>>>
if s + w - 1 < start:
# must be changed to
if s + w - 1 <= start or (s==start):
start += w
width = width - (s - start)
# must be changed to
width -= (s-start) # this line MUST BE before the following one
start = s + w # because start += (s-start) + w
width -= (s - start)
start = s + w
'#f#oo##ba###r##' : s,w==0,1 , 0==s==start==0
'ax##f#oo##ba###r##' : s,w==2,2 , 2==s==start==2
'abxy###f#oo##ba###r##' : s,w==4,3 , 4==s==start==4
'#f#oo##ba###r## aaaBL##foob##arAH' : s,w==0,1 , 0==s==start==0
'BLAH#fo##o##ba###r aaBLfoob##ar' : s,w==4,1 4==s==start==4
'iji#hkh#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7
'mn##pps#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7