Python-从文本文件中提取字符串,直到新的前2行空间

Python-从文本文件中提取字符串,直到新的前2行空间,python,python-3.x,Python,Python 3.x,我有一个输入文件,而我必须提取几行的基础上2个空白的新行 例如:文本文件如下所示 1. Sometext Sometext Sometext 2. Sometext Sometext Sometext 3. Sometext Sometext Sometext Sometext which is not needed Sometext which is not needed Sometext which is not needed 我必须提取一个子字符串,声明从“1”到“2”之前的所有

我有一个输入文件,而我必须提取几行的基础上2个空白的新行

例如:文本文件如下所示

1. Sometext
Sometext 
Sometext

2. Sometext
Sometext
Sometext

3. Sometext
Sometext
Sometext

Sometext which is not needed
Sometext which is not needed
Sometext which is not needed
我必须提取一个子字符串,声明从“1”到“2”之前的所有 以及第二个子串,从“2”到“3”之前的所有,依此类推。我有下面的脚本,它得到了输出,但它也得到了我不想要的所有“不需要的文本”。请参阅下面的代码:

file_path = open("filename", "r")
content = file_path.read()
size1 = len(content)
start =0
a=1
b=2
end =0
ext =0   

while (start<size):
   if (end !=-1):
   subString = content[content.find(str(a)+".")+0:content.find("\n"+str(b)+".")] 
   print (subString)
   end = content.find(str(b)+".",start)
                print ("\n")
                a = int(a)+1 # increment to find the next start number
                b = int(b)+1 # increment to find the next end number
                start = end+1 # continuing to search the next
            else:
                break
如果您有任何问题,请帮助并让我知道。
提前谢谢。

我不确定是否正确理解了您的问题,但下面是将输出的代码:

['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
基于你问题中的文本。相反,如果希望1到2是一个完整的子字符串,如下所示:

['1. Sometext\nSometext\nSometext']
['2. Sometext\nSometext\nSometext']
['3. Sometext\nSometext\nSometext']
您应该将if语句更改为:

if is_number(i[0]):
            substring = []
            substring.append(i)
            print(substring)
否则您可以使用下面的代码

def is_number(string):
    try:
        float(string)
        return True
    except ValueError:
        return False

with open('testing.txt', 'r') as f:
content = f.read().split('\n\n')
for i in content:
    if is_number(i[0]):
        c = i.split('\n')
        substring = [line[3:] if is_number(line[0]) else line for line in c]
        print(substring)

我不确定是否正确理解了您的问题,但下面是将输出的代码:

['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
['Sometext', 'Sometext', 'Sometext']
基于你问题中的文本。相反,如果希望1到2是一个完整的子字符串,如下所示:

['1. Sometext\nSometext\nSometext']
['2. Sometext\nSometext\nSometext']
['3. Sometext\nSometext\nSometext']
您应该将if语句更改为:

if is_number(i[0]):
            substring = []
            substring.append(i)
            print(substring)
否则您可以使用下面的代码

def is_number(string):
    try:
        float(string)
        return True
    except ValueError:
        return False

with open('testing.txt', 'r') as f:
content = f.read().split('\n\n')
for i in content:
    if is_number(i[0]):
        c = i.split('\n')
        substring = [line[3:] if is_number(line[0]) else line for line in c]
        print(substring)

您必须在末尾过滤不需要的行,但这将得到您想要的:

from itertools import groupby
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print([list(v) for k,v in grps if k])
输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
由于所有要保留的部分都以数字开头:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))
输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
如果您知道有
n
组,您可以切片:

from itertools import groupby, islice
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(islice((list(v) for k,v in grps if k),3)))
输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

您必须在末尾过滤不需要的行,但这将得到您想要的:

from itertools import groupby
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print([list(v) for k,v in grps if k])
输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
由于所有要保留的部分都以数字开头:

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(takewhile(lambda x: x[0][0].isdigit(),(list(v) for k,v in grps if k))))
输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
如果您知道有
n
组,您可以切片:

from itertools import groupby, islice
with open("in.txt") as f:
    grps = groupby(f, key=lambda x: bool(x.strip()))
    print (list(islice((list(v) for k,v in grps if k),3)))
输出:

[['1. Sometext\n', 'Sometext\n', 'Sometext\n'], ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], ['3. Sometext\n', 'Sometext\n', 'Sometext\n'], ['Sometext which is not needed\n', 'Sometext which is not needed\n', 'Sometext which is not needed']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'],
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]
[['1. Sometext\n', 'Sometext\n', 'Sometext\n'],
 ['2. Sometext\n', 'Sometext\n', 'Sometext\n'], 
['3. Sometext\n', 'Sometext\n', 'Sometext\n']]

“一些不需要的文本”与其他行有何不同?它是在文件的结尾还是什么?您的示例数据格式是否正确?似乎每行之间有3个换行符。@GriMel。。“Sometext what not needed”(不需要的文本)不会以2 new之后的数字开头lines@glibdud.. 抱歉,我尝试用两行换行符格式化,但文本仅在一行中。。所以我不得不这样编辑它。请考虑这只是一个例子,只有2行分离。请参阅编辑我建议…这是否准确地表示了数据的外观?不需要的文本与其他行有何不同?它是在文件的结尾还是什么?您的示例数据格式是否正确?似乎每行之间有3个换行符。@GriMel。。“Sometext what not needed”(不需要的文本)不会以2 new之后的数字开头lines@glibdud.. 抱歉,我尝试用两行换行符格式化,但文本仅在一行中。。所以我不得不这样编辑它。请考虑这只是一个例子,只有2行分离。请参阅编辑我建议…这是否准确地表示了数据的样子?。如果我不够清楚,我很抱歉。您编写的代码工作正常,但只有当行的开头包含“1”时,它才能工作。我基本上希望在文件中的任何位置找到“1”的第一个匹配项,并继续查找最后一个数字(在示例中,它是“3”。但可以是任何数字)。以两行新行结尾。我会尝试修改一下你的代码,看看是否有帮助。。但是如果你能得到它,请让我知道。非常感谢,如果我还不清楚,请告诉我……如果我不够清楚,我很抱歉。您编写的代码工作正常,但只有当行的开头包含“1”时,它才能工作。我基本上希望在文件中的任何位置找到“1”的第一个匹配项,并继续查找最后一个数字(在示例中,它是“3”。但可以是任何数字)。以两行新行结尾。我会尝试修改一下你的代码,看看是否有帮助。。但是如果你能得到它,请让我知道。非常感谢,如果我还不清楚,请告诉我。谢谢你,帕德雷克。您的代码可以工作,但它将整个文件作为一个值。我需要将这些值拆分(根据示例为3个值)。此外,它还附加了\n。我将如何删除它?非常感谢您的支持help@Sanjivi,您确定总有3个部分吗?另外,它分为三个部分,每个子列表都是一个部分,为了删除换行符,我们只需要将它们去掉,对延迟响应表示歉意。不,可以有很多部分。我已经展示了3个例子。是的,每个子列表都是一个由3个空白新行分隔的部分。我想将每个子列表提取为单个子字符串,并将其提供给我的系统。您的代码可以工作,但它将整个文件作为一个值。我需要将这些值拆分(根据示例为3个值)。此外,它还附加了\n。我将如何删除它?非常感谢您的支持help@Sanjivi,您确定总有3个部分吗?另外,它分为三个部分,每个子列表都是一个部分,为了删除换行符,我们只需要将它们去掉,对延迟响应表示歉意。不,可以有很多部分。我已经展示了3个例子。是的,每个子列表都是一个由3个空白新行分隔的部分。我想将每个子列表提取为单个子字符串,并将其提供给我的系统。