Python 从.txt文件中删除重复项并创建新的.txt文件_Python_File

Python 从.txt文件中删除重复项并创建新的.txt文件

python file

Python 从.txt文件中删除重复项并创建新的.txt文件,python,file,Python,File,我有一个.txt文件，里面填充了我要过滤的数据（大约5800行），因为有些行是重复的，唯一的区别是时间戳正好在2小时后。那些是副本的最新版本的行（例如，附件示例中的第一行）应该省略。所有其他行都应保留并写入新的.txt文件 1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check # should go 1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check # sh

我有一个.txt文件，里面填充了我要过滤的数据（大约5800行），因为有些行是重复的，唯一的区别是时间戳正好在2小时后。那些是副本的最新版本的行（例如，附件示例中的第一行）应该省略。所有其他行都应保留并写入新的.txt文件

1_3_IMM 2016-07-19 16:11:56 00:00:40    2   Sensor Check   #   should go
1_3_IMM 2016-07-19 14:12:40 00:00:33    2   Sensor Check   #   should stay
1_3_IMM 2016-07-19 14:11:56 00:00:40    2   Sensor Check   #   should stay
1_3_IMM 2016-07-19 16:12:40 00:00:33    2   Sensor Check   #   should go
1_4_IMM 2016-07-19 17:23:25 00:00:20    2   Sensor Check   #   should stay
1_4_IMM 2016-07-19 19:23:25 00:00:20    2   Sensor Check   #   should go
1_4_IMM 2016-07-19 19:15:24 00:02:21    2   Sensor Check   #   should stay
1_4_IMM 2016-07-19 19:25:13 00:02:13    2   Sensor Check   #   should stay

我用Python编写了一些代码，输出是一个.txt文件，只有一行文本：

deleted

我似乎无法解决这个问题。你能帮忙吗？请参阅下面的代码

import os

def filter_file():
    with open("output.txt", "w") as output: 
        #open the input file from a specified directory
        directory = os.path.normpath("C:/Users/sande_000/Documents/Python files")
        for subdir, dirs, files in os.walk(directory):
            for file in files:
                if file.startswith("input"):
                    input_file=open(os.path.join(subdir, file))
                    #iterate over each line of the file
                    for line in input_file:
                        machine = line[0:7]             #stores machine number
                        date = line[8:18]               #stores date stamp
                        time_1 = int(line[19:21])       #stores hour stamp
                        time_2 = int(line[22:24])       #stores minutes stamp
                        time_3 = int(line[25:27])       #stores second stamp
                        #check current line with other lines for duplicates by iterating over each line of the file
                        for otherline in input_file:
                            compare_machine = otherline[0:7]            
                            compare_date = otherline[8:18]
                            compare_time_1 = int(otherline[19:21])+2
                            compare_time_2 = int(otherline[22:24])
                            compare_time_3 = int(otherline[25:27])
                            #check whether machine number & date/hour+2/minutes/seconds stamp are similar.
                            #If yes, write 'deleted' to output.txt and stop comparing lines.
                            #If no, continue with comparing next line.
                            if compare_machine == machine and compare_date == date and compare_time_1 == time_1 and compare_time_2 == time_2 and compare_time_3 == time_3:
                                output.write("deleted"+"\n")
                                break
                            else:
                                continue
                            #If no overlap between one line with any other line from the file, write that line to output.txt since it is no duplicate.
                            output.write(line)

                    input_file.close()

if __name__ == "__main__":
    filter_file()

我认为这段较短的代码应该可以做到这一点。 Is有两个连续循环，而不是嵌套循环，这将提高性能

从datetime导入datetime，timedelta
#步行等。
对于文件中的文件：
如果不是file.startswith（“输入”）：
持续
条目=集合（）
#累积条目
对于输入_文件中的行：
机器=行[0:7]#存储机器编号
date=datetime.strTime（第[8:27]行，“%Y-%m-%d%H:%m:%S”）
条目。添加（（机器、日期））
#检查条目
对于输入_文件中的行：
机器=行[0:7]#存储机器编号
date=datetime.strTime（第[8:27]行，'%Y-%m-%d%H:%m:%S'）-timedelta（小时数=2）
条目中的if（机器、日期）：
output.write（“已删除\n”）
其他：
输出。写入（行）
output.flush（）

我相信下面的代码是有效的。请注意，如果由于

datetime

不支持超过微秒的分辨率，记录的最小三个时间分量（毫秒、微秒、纳秒）有任何变化，则此代码将不起作用。在你的例子中，这不会有什么不同

import os
from datetime import datetime, timedelta

INPUT_DIR = 'C:\Temp'
OUTPUT_FILE = 'output.txt'


def parse_data(data):
    for line in data.splitlines():
        date_s = ' '.join(line.split()[1:3])
        date = datetime.strptime(date_s, '%Y-%m-%d %H:%M:%S')
        yield line, date


def filter_duplicates(data):
    duplicate_offset = timedelta(hours=2)

    parsed_data = list(parse_data(data))
    lines, dates = zip(*parsed_data)

    for line, date in parsed_data:
        if (date - duplicate_offset) not in dates:
            yield line


def get_input_data_from_dir(directory):
    data = ''
    for sub_dir, _, files in os.walk(directory):
        for file in files:
            if file.startswith('input'):
                with open(os.path.join(sub_dir, file)) as f:
                    data += f.read() + '\n'

    return data


if __name__ == '__main__':
    data = get_input_data_from_dir(INPUT_DIR)
    with open(OUTPUT_FILE, 'w') as f_out:
        content = '\n'.join(filter_duplicates(data))
        f_out.write(content)

已测试具有以下结构的输入目录：

me@my-计算机/cygdrive/c/Temp
$tree
.
├── 输入_1.txt
└── 输入_2.txt

input_1.txt

：

1\u 3\u IMM 2016-07-19 16:11:56 00:00:40 2传感器检查
1_3 _IMM 2016-07-19 14:12:40 00:00:33 2传感器检查
1_3 _IMM 2016-07-19 14:11:56 00:00:40 2传感器检查
1_3 _IMM 2016-07-19 16:12:40 00:00:33 2传感器检查

input_2.txt

：

1\u 4\u IMM 2016-07-19 17:23:25 00:00:20 2传感器检查
1_4 _IMM 2016-07-19 19:23:25 00:00:20 2传感器检查
1_4 _IMM 2016-07-19 19:15:24 00:02:21 2传感器检查
1_4 _IMM 2016-07-19 19:25:13 00:02:13 2传感器检查

执行后，

output.txt

：

1\u 3\u IMM 2016-07-19 14:12:40 00:00:33 2传感器检查
1_3 _IMM 2016-07-19 14:11:56 00:00:40 2传感器检查
1_4 _IMM 2016-07-19 17:23:25 00:00:20 2传感器检查
1_4 _IMM 2016-07-19 19:15:24 00:02:21 2传感器检查
1_4 _IMM 2016-07-19 19:25:13 00:02:13 2传感器检查

为方便起见，请复制以下预期输出：

1\u 3\u IMM 2016-07-19 16:11:56 00:00:40 2传感器检查应进行
1_3 _IMM 2016-07-19 14:12:40 00:00:33 2传感器检查#应保留
1_3 _IMM 2016-07-19 14:11:56 00:00:40 2传感器检查#应保留
1_3 _IMM 2016-07-19 16:12:40 00:00:33 2传感器检查应进行
1_4 _IMM 2016-07-19 17:23:25 00:00:20 2传感器检查#应保留
1_4 _IMM 2016-07-19 19:23:25 00:00:20 2传感器检查应进行
1_4 _IMM 2016-07-19 19:15:24 00:02:21 2传感器检查#应保留
1_4 _IMM 2016-07-19 19:25:13 00:02:13 2传感器检查#应保留

当然

14:12:40

应该留下，而

16:12:40

应该走，对吗？相反，你的文本文件中没有逻辑。一次较早的人应该离开，另一次较早的人应该留下。此外，日志文件未按日期排序。顺序重要吗？那确实是个错误，文本文件现在应该是合乎逻辑的了。为了清楚起见：越早应该留下，越晚应该离开。日志文件确实没有按日期排序，对此我无能为力。我尝试了几次，看起来很干净，但它无法将任何内容写入output.txt。只是一个空文件…您可能需要在写入后调用output.flush（）。不幸的是，当两个小时过了午夜时，此代码没有问题。您应该使用datetime和timedelta来compare@hansaplast你是对的。我添加了这个（还有

.flush（）

）。我写的python代码几乎相同，当我看到你的帖子时，我的支持率达到了80%。向上投票。但有一个问题是：作者要求在看到一篇文章时写“删除”duplicate@hansaplast当作者看到一个复制品“你在哪里看到的？”？OP写道：“那些是副本的最新版本的行（例如附件中的第一行）应该被省略。”@hansaplast哦，我明白了，我看了问题主体，但没有看他代码中的注释。我将尝试更新我的答案。很棒的代码，有一件事：这会打印输出，但我希望将输出写入一个新的output.txtfile@tagc是的，你是对的，我错了。这是在代码中，所以我认为这是我的行为