Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/ms-access/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何唯一化大文本文件内容_Python - Fatal编程技术网

Python 如何唯一化大文本文件内容

Python 如何唯一化大文本文件内容,python,Python,我有一个文本文件,有34686770行。所有线的长度都在50到250之间。有些行出现在多个行中。我想让所有这些线条都独一无二 我无法将所有这些行存储在一个列表中以使其唯一。我怎样才能做到这一点 Only has limited access to OBDII data stream unless you pay more money to upgrade the software. I thought the author should have used more dialogue. It r

我有一个文本文件,有34686770行。所有线的长度都在50到250之间。有些行出现在多个行中。我想让所有这些线条都独一无二

我无法将所有这些行存储在一个列表中以使其唯一。我怎样才能做到这一点

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.
我必须使文件具有唯一的行

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
如何执行此操作?

使用shell工具:

$ cat in.txt 
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.
$ sort < in.txt | uniq
I thought the author should have used more dialogue. It reads like a history book.
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
$cat in.txt
只有有限的访问OBDII数据流,除非您支付更多的钱升级软件。
我认为作者应该用更多的对话。它读起来像一本历史书。
我认为作者应该用更多的对话。它读起来像一本历史书。
$sort
在不将所有文本存储在内存中的情况下:

with open('text.txt') as text:
    with open('unique.txt', 'w') as output:
        seen = set()
        for line in text:
            line_hash = hash(line)
            if line_hash not in seen:
                output.write(line)
                seen.add(line_hash)

相反,我们存储的是一个小得多的文本散列。当然,有可能发生哈希冲突,在这种情况下,此代码将跳过应包含的唯一行。

如果无法将文件加载到内存中,为什么不将其巧妙地拆分为较小的文件并在那里工作呢。您只需要知道相同的行最终会出现在相同的文件中,并且希望一些冲突不会以大量文件结束

下面是一个脚本,它获取每个句子的前缀(可以明显更改),并将句子放入与前缀对应的文件中

这实际上很像散列映射,只是不在内存中,因为RAM无法处理您试图处理的数据量

结果是许多较小的文件(bucket,如果您愿意的话…)将所有出现的行分组到某个文件中(相同的前缀)。它们可以是单独的,然后合并到结果文件中

下面是如何做到的:

初始化程序以读取文件
input.txt
,并使用前缀大小
2
进行哈希/拆分,写入
output.txt

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2
创建包含包含类似和相同行的拆分文件的文件夹:

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)
行分布函数-将行放入指定文件中:

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)
Hash函数,它承诺某些冲突(这是好的),并且相同的行在类似文件中:

def prefix_hash(line):
    return line[:prefix_size]
现在,我们将行分发到它们较小的文件(如散列“bucket”)

生成已创建文件名的列表:

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)
在较小的文件中消除重复行:

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))
将较小的文件加入结果文件:

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())
整个事情:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

def prefix_hash(line):
    return line[:prefix_size]

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

注意:为了加快速度,您应该始终保持文件处理程序处于打开状态,并可能使一些线程使用队列在它们之间传递行(防止等待I/O以及打开和关闭文件)。如果有人需要,我可以稍后再添加。

你能不能把它们放在一起?我不能。文本文件的大小为5GB。我的RAM不允许这样做。您可以对该行进行哈希运算,并检查哈希值是否存在于set@EdChum,我认为你的想法会有帮助。你能给我一个示例代码吗?
import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

def prefix_hash(line):
    return line[:prefix_size]

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())