Performance 快速而疯狂地随机读取大量文件_Performance_File_Dataset_Large Data

Performance 快速而疯狂地随机读取大量文件

performance file

Performance 快速而疯狂地随机读取大量文件,performance,file,dataset,large-data,Performance,File,Dataset,Large Data,我知道这个问题并不新鲜，但我没有发现任何有用的东西。在我的例子中，我有一个20GB的文件，我需要从中随机读取行。现在我有了简单的文件索引，其中包含行号和相应的搜索偏移量。此外，我在读取时禁用了缓冲，以仅读取所需的行这是我的代码： def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','): index = load_file_index(file_path) if (

我知道这个问题并不新鲜，但我没有发现任何有用的东西。在我的例子中，我有一个20GB的文件，我需要从中随机读取行。现在我有了简单的文件索引，其中包含行号和相应的搜索偏移量。此外，我在读取时禁用了缓冲，以仅读取所需的行

这是我的代码：

def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','):
    index = load_file_index(file_path)

    if (batch_size > len(index)) or (batch_size == 0):
        batch_size = len(index)

    lines_indices = np.random.random_integers(0, len(index), batch_size)

    with io.open(file_path, 'rb', buffering=0) as f:
        for line_index in lines_indices:
            f.seek(index[line_index])
            line = f.readline(2048)
            yield __get_features_from_line(line, delimiter, dtype)

问题是它非常慢：在我的Mac上读取5000行需要89秒（这里我指的是ssd驱动器）。我使用以下代码进行测试：

features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above

i = 0
for feature, cls in features_gen:
    if i % 1000 == 0:
        print("Got %d features" % i)

    i += 1

print("Total %d features" % i)

我读过一些关于文件内存映射的文章，但我并不真正理解它是如何工作的：映射在本质上是如何工作的，它是否会加速这个过程

那么，主要的问题是，有哪些可能的方法可以加快这一进程？我现在看到的唯一方法是随机阅读，不是每一行，而是每一行