用python并行读取文件_Python_File

用python并行读取文件

python file

用python并行读取文件,python,file,Python,File,我有一堆文件（几乎100个），其中包含以下格式的数据：（人数）\t（平均年龄）这些文件是由对某一特定人群进行的随机游走产生的。每个文件有100000行，对应于大小为1到100000的人口的平均年龄。每个文件对应于第三世界国家的不同位置。我们将把这些数值与发达国家类似规模地区的平均年龄进行比较我想做的是 for each i (i ranges from 1 to 100,000): Read in the first 'i' values of average-age perfor

我有一堆文件（几乎100个），其中包含以下格式的数据：（人数）\t（平均年龄）

这些文件是由对某一特定人群进行的随机游走产生的。每个文件有100000行，对应于大小为1到100000的人口的平均年龄。每个文件对应于第三世界国家的不同位置。我们将把这些数值与发达国家类似规模地区的平均年龄进行比较

我想做的是

for each i (i ranges from 1 to 100,000):
  Read in the first 'i' values of average-age
  perform some statistics on these values

这意味着，对于每次运行i（i范围从1到100000），读取平均年龄的第一个i值，将它们添加到列表中，并运行一些测试（如Kolmogorov-Smirnov或卡方检验）

为了并行地打开所有这些文件，我认为最好的方法是使用文件对象字典。但是我被困在尝试做上述操作

我的方法是最好的吗（复杂度方面）

有更好的方法吗？

实际上，可以在内存中保存10000000行

制作一本字典，其中键是

人数

，值是

平均年龄的列表

，列表中的每个元素来自不同的文件。因此，如果有100个文件，那么每个列表将有100个元素

这样，您就不需要将文件对象存储在

dict

希望这会有所帮助实际上，它可以在内存中保存10000000行

制作一本字典，其中键是

人数

，值是

平均年龄的列表

，列表中的每个元素来自不同的文件。因此，如果有100个文件，那么每个列表将有100个元素

这样，您就不需要将文件对象存储在

dict

希望这对我有帮助。。。我不知道我是否喜欢这种方法，但它可能适合你。它可能会消耗大量内存，但可能会做您需要的事情。我假设您的数据文件已编号。如果情况并非如此，则可能需要调整

# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]

# loop for the number of lines.
for line in range(100000):
  lines = [fh.readline() for fh in handles]

  # Some sort of processing for the list of lines.

这可能接近你所需要的，但再一次，我不知道我喜欢它。如果您有任何行数不相同的文件，则可能会遇到问题。

I。。。我不知道我是否喜欢这种方法，但它可能适合你。它可能会消耗大量内存，但可能会做您需要的事情。我假设您的数据文件已编号。如果情况并非如此，则可能需要调整

# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]

# loop for the number of lines.
for line in range(100000):
  lines = [fh.readline() for fh in handles]

  # Some sort of processing for the list of lines.

这可能接近你所需要的，但再一次，我不知道我喜欢它。如果有任何文件的行数不相同，则可能会遇到问题。

为什么不采取一种简单的方法：

按顺序打开每个文件并读取其行以填充内存中的数据结构
对内存中的数据结构执行统计

下面是一个包含3个“文件”的自包含示例，每个文件包含3行。为了方便起见，它使用

StringIO

，而不是实际文件：

#!/usr/bin/env python
# coding: utf-8

from StringIO import StringIO

# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'

files = [f1, f2, f3]

# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []

for i,filename in enumerate(files):
    f = StringIO(filename)
    # f = open(filename, 'r')
    data.append(dict())

    for line in f:
        population, average_age = (int(s) for s in line.split('\t'))
        data[i][population] = average_age

print data

# gather custom statistics on the data

# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'

输出为：

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old

为什么不采取一种简单的方法：

按顺序打开每个文件并读取其行以填充内存中的数据结构
对内存中的数据结构执行统计

下面是一个包含3个“文件”的自包含示例，每个文件包含3行。为了方便起见，它使用

StringIO

，而不是实际文件：

#!/usr/bin/env python
# coding: utf-8

from StringIO import StringIO

# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'

files = [f1, f2, f3]

# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []

for i,filename in enumerate(files):
    f = StringIO(filename)
    # f = open(filename, 'r')
    data.append(dict())

    for line in f:
        population, average_age = (int(s) for s in line.split('\t'))
        data[i][population] = average_age

print data

# gather custom statistics on the data

# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'

输出为：

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old

“读取所有文件的第一个i平均年龄（将其放入列表或其他内容中）？这意味着什么？对于范围（100）内的i是否意味着

：从文件中读取i行

？如果是，请更新您的算法。如果文件很小，您将增加开销，以便同时访问所有文件，因为GIL和文件位于同一硬盘中，每个文件中有100000行。我想读取i范围为1到100000的第一个i文件。我不认为他正在使用word线程意义上的“并行”（你是吗，克雷格？）。这可能不是您想要的答案，但这正是关系数据库设计用来回答的问题。我会建立某种SQL DB，快速加载其中的所有内容，然后您将获得更大的成功，而无需重复加载和读取文件的开销（即使您以滚动方式这样做）。“读取所有文件的第一个i平均年龄（将其放入列表或其他内容中）？这意味着什么？对于范围（100）内的i是否意味着

：从文件中读取i行

？如果是，请更新您的算法。如果文件很小，您将增加开销，以便同时访问所有文件，因为GIL和文件位于同一硬盘中，每个文件中有100000行。我想读取i范围为1到100000的第一个i文件。我不认为他正在使用word线程意义上的“并行”（你是吗，克雷格？）。这可能不是您要寻找的答案，但这正是关系数据库设计用来回答的问题。我会建立某种SQL DB，快速加载其中的所有内容，然后您将获得更大的成功，而无需重复加载和读取文件的开销（即使您以滚动方式这样做）.是的，这些文件有不同的行数。发生这种情况是因为随机游走在某些情况下失败了runs@Craig-我刚刚运行了一个快速测试，它看起来像readline（）当到达文件末尾时，将返回一个空白字符串。这将使您在处理过程中的测试变得容易，并且看起来不会引发异常。是的，文件的行数不同。发生这种情况是因为随机游动在某些情况下失败了runs@Craig-我刚刚运行了一个快速测试，它看起来像readline（）将在到达文件末尾时返回一个空白字符串。这将使您在处理过程中的测试变得容易，并且看起来不太好