使用Python从txt文件中删除副文本（或'；noise'；）_Python_Enumerate_Data Cleaning

使用Python从txt文件中删除副文本（或'；noise'；）

python

使用Python从txt文件中删除副文本（或'；noise'；）,python,enumerate,data-cleaning,Python,Enumerate,Data Cleaning,我正在准备一个文本文件文集，由170本荷兰小说组成。我是一名文学学者，对Python和编程都比较陌生。我试图做的是编写一个Python脚本，从每个.txt文件中删除不属于小说实际内容（即故事）的所有内容。我想删除的内容包括：添加作者的传记、简介以及将ePub转换为.txt时附带的其他信息我的想法是手动为每个.txt文件决定小说的实际内容从哪一行开始，在哪里结束。我使用以下代码块来删除.txt文件中不包含在这两个行号之间的所有信息： def removeparatext(inputFilenam

我正在准备一个文本文件文集，由170本荷兰小说组成。我是一名文学学者，对Python和编程都比较陌生。我试图做的是编写一个Python脚本，从每个.txt文件中删除不属于小说实际内容（即故事）的所有内容。我想删除的内容包括：添加作者的传记、简介以及将ePub转换为.txt时附带的其他信息

我的想法是手动为每个.txt文件决定小说的实际内容从哪一行开始，在哪里结束。我使用以下代码块来删除.txt文件中不包含在这两个行号之间的所有信息：

def removeparatext(inputFilename, outputFilename):
    inputfile = open(inputFilename,'rt', encoding='utf-8')
    outputfile = open(outputFilename, 'w', encoding='utf-8')

    for line_number, line in enumerate(inputfile, 1):
        if line_number >= 80 and line_number <= 2741: 
            outputfile.write(inputfile.readline())

    inputfile.close()
    outputfile.close()

removeparatext(inputFilename, outputFilename)

def removeparatext（inputFilename，outputFilename）：
inputfile=open（inputFilename，'rt'，encoding='utf-8'）
outputfile=open（outputFilename'w'，encoding='utf-8'）
对于行_编号，枚举中的行（inputfile，1）：
如果行号>=80且行号枚举
已在其索引旁边提供了行，则无需再次调用文件对象上的readline
，因为这将导致不可预测的行为-更像是以双速度读取文件对象：
for line_number, line in enumerate(inputfile, 1):
    if line_number >= 80 and line_number <= 2741: 
        outputfile.write(line)
#                        ^^^^


此外，通过使用带有with
语句的上下文管理器，您可以打开文件并将关闭/清理留给Python。看
非常感谢你！使用itertools.islice对我来说很好。我已经知道使用with语句打开文件，但我不知道如何在打开两个文件而不是一个文件时使用它。
from itertools import islice

def removeparatext(inputFilename, outputFilename):
    inputfile = open(inputFilename,'rt', encoding='utf-8')
    outputfile = open(outputFilename, 'w', encoding='utf-8')

    # use writelines to write sliced sequence of lines 
    outputfile.writelines(islice(inputfile, 79, 2741)) # indices start from zero

    inputfile.close()
    outputfile.close()

from itertools import islice

def removeparatext(inputFilename, outputFilename):
    with open(inputFilename,'rt', encoding='utf-8') as inputfile,\
         open(outputFilename, 'w', encoding='utf-8') as outputfile:    
        # use writelines to write sliced sequence of lines 
        outputfile.writelines(islice(inputfile, 79, 2741))


removeparatext(inputFilename, outputFilename)