Python 在文本文件中查找最大匹配区域_Python_Python 3.x

Python 在文本文件中查找最大匹配区域

python python-3.x

Python 在文本文件中查找最大匹配区域,python,python-3.x,Python,Python 3.x,txt包含如下所示的行（或者是其中的一小部分）： >>>print x '1333113313' 本质上，最后一个字符串是1或3。假设上面的示例持续很长时间，我需要做的是找到最大数量的连续行，这些行的末尾有1，同时保持3的数量小于或等于某个数字（例如，2）。例如，假设A.txt的整体外观如下： Green- Blue- 1 Red- Black- 3 Brown- Blue- 3 Black- Red- 3 Green- Blue- 1 Green- Purple- 1 Re

txt包含如下所示的行（或者是其中的一小部分）：

>>>print x
'1333113313'

本质上，最后一个字符串是1或3。假设上面的示例持续很长时间，我需要做的是找到最大数量的连续行，这些行的末尾有1，同时保持3的数量小于或等于某个数字（例如，2）。例如，假设A.txt的整体外观如下：

Green- Blue- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 3
Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1
Blue- Blue- 3

>>>print x
'1333113313'

然后，脚本将向另一个文本文件写入以下行：

Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1

>>>print x
'1333113313'

我将如何编写此代码？提前谢谢

首先，起始字符串是完全不相关的。其次，解决这个问题可能有100种方法。我只想列出我最喜欢的一个

>>>print x
'1333113313'

我们还可以假设起始边界始终为：

>>>print x
'1333113313'

a）列表的开头

>>>print x
'1333113313'

b）就在3点之后

>>>print x
'1333113313'

我们还可以假设，结束边界始终为：

>>>print x
'1333113313'

a）名单的末尾

>>>print x
'1333113313'

b）就在3点之前

>>>print x
'1333113313'

那么，让我们做一个新的

threes = [-1, ... numbers.length + 1]

>>>print x
'1333113313'

在哪里。。。是每3行的行号。我将-1和numbers.length+1添加到列表中，以“假装”我们的列表被两个3包围，从而简化逻辑

>>>print x
'1333113313'

因为问题陈述中没有指定，所以我们也可以假设列表将始终包含至少2个3，如果可能的话。原因是，这将给我们最大的范围

>>>print x
'1333113313'

现在，我们需要做的就是找到任意两个三之间的最大行号范围

max_range = -1 # number of lines between two 3s.
max_start = -1 # start line
max_end = -1   # end line

if len(threes) == 2: # special case here.  If the original list contains no 3s, we will take the whole list.
    max_start = threes[0]
    max_end = threes[1]
    max_range = max_end - max_start
else:
    for i in range(len(threes) - 2):
        # The general case.  Find the range between any two consecutive 3s.
        start = threes[i]
        end = threes[i + 2]
        range = end - start

        if range > max_range:
            max_start = start
            max_end = end
            max_range = range
max_start += 1
max_end -= 1
max_range -= 2

>>>print x
'1333113313'

这里有一些边缘案例需要解决，但这应该让你开始

>>>print x
'1333113313'

第一种边缘情况（在问题中没有真正定义）是如果我以[1,1,1,3,3]结尾会发生什么？我应该选择0-3、0-4还是0-5？所有这些似乎都是有效的解决方案。在这段代码中，我选择了0-5，因为它没有指定，这使代码更简单。

您真的没有其他选择，只能遍历整个文件，跟踪最大的序列。这是我的想法，用一个函数封装：它使用堆栈并逐行迭代文件，因此对于大型输入文件来说，它应该是内存有效的

def foo(in_file, out_file, max_count):
    biggest, stack = [], []
    count = 0
    with open(in_file) as f:
        for line in f:
            if line[-2] == '3':
                count += 1
            if count > max_count:
                if len(stack) > len(biggest):
                    biggest = list(stack)
                # this line trims the list after the first element that ends with '3'
                stack = stack[stack.index(next(x for x in stack if x[-2] == '3')) + 1:]
                count = max_count
            stack.append(line)

    with open(out_file, 'w') as f:
        f.write(''.join(max(biggest, stack)))

>>>print x
'1333113313'

注意：仅当文件末尾包含一个空行，并且假定

max\u count

始终大于0时，此操作才会按预期工作（否则对

next

的调用会引发未处理的异常）.

您可以考虑使用itertools.groupby组合存储索引

txt = '''Green- Blue- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 3
Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1
Blue- Blue- 3'''

import operator
from itertools import groupby
str_lst = list( enumerate( txt.split('\n') ) )

grp_lst = [ list(g) for k, g in groupby( [ (k,v[-1]) for k, v in str_lst ], key=operator.itemgetter(1) ) ]
filter_lst  = [ (i[0], len(i)) for i in grp_list if i[0][1] == '1' ]

for i in grp_list:
    if i[0] == max( dict(filter_lst).items(), key=operator.itemgetter(1) )[0]:
        idx = grp_list.index(i)
        break

for i in sum( grp_lst[idx:idx+3], [] ):
    print (str_lst[i[0]][1])

>>>print x
'1333113313'

输出：

Green- Blue- 1
Green- Purple- 1
Red- Black- 3
Brown- Blue- 3
Black- Red- 1

>>>print x
'1333113313'

这是我的解决方案

>>>print x
'1333113313'

首先，读取文件并仅提取实际需要的数据，即最后一位数字

x = ''
for i, line in enumerate(txt.split('\n')):
    try:
        x += line[-1]
    except IndexError:
        pass

>>>print x
'1333113313'

您将得到一个字符串，其中包含一行接一行显示的所有1和3

>>>print x
'1333113313'

此时，您可以对该字符串进行迭代，并收集所有可能的子字符串，这些子字符串不包含超过两次的3。您可以跟踪字符串第一个字母的索引及其长度

>>>print x
'1333113313'

results = {}
for i, n in enumerate(x):
    for idx in range(i+1, len(x)):
        if x[i:idx].count('3') <= 2:
            results[i] = len(x[i:idx])
        else:
            break

您可以使用此信息写入输出文件。因此，您将从第4行开始保存5行

>>>print x
'1333113313'

with open('myfile.txt', 'r') as inp, open('out.txt', 'w') as out:
    for line in inp.readlines()[m[0]:m[0]+m[1]]
        out.write(line)

“我所能想到的就是使用count的效率极低的方法。”-好吧，请把这一尝试添加到问题中。如果你表现出努力解决问题，你就更有可能得到帮助。另外，我建议从你的主要问题中删除模块推荐问题，因为推荐问题与网站主题无关。我更多的是给出伪代码，而不是特定语言的答案。但是，这很公平，它是肾盂化的