python中高效的文件读取,需要在'\n';

python中高效的文件读取,需要在'\n';,python,multiprocessing,Python,Multiprocessing,我一直在阅读以下文件: file = open(fullpath, "r") allrecords = file.read() delimited = allrecords.split('\n') for record in delimited[1:]: record_split = record.split(',') 及 但当我在多处理线程中处理这些文件时,似乎会得到MemoryError。当我正在读取的文本文件需要在'\n'上拆分时,如何才能最好地逐行读取文件 以下是多处理代码:

我一直在阅读以下文件:

file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('\n')
for record in delimited[1:]:
    record_split = record.split(',')

但当我在多处理线程中处理这些文件时,似乎会得到MemoryError。当我正在读取的文本文件需要在
'\n'
上拆分时,如何才能最好地逐行读取文件

以下是多处理代码:

pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)     
while not op_list.ready():
  print("Number of files left to process: {}".format(op_list._number_left))
  time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()
这是错误日志

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Python27\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:\Python27\lib\multiprocessing\pool.py", line 380, in _handle_results
    task = get()
MemoryError
我正试图按照迈克善意的建议安装pathos,但我遇到了一些问题。这是我的安装命令:

pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre
但以下是我收到的错误消息:

Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip-1e4saj-b
uild\setup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master

Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
  Running setup.py (path:c:\users\xxx\appdata\local\temp\2\pip_build_jp
tyuser\ppft\setup.py) egg_info for package ppft

    warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:\pyth
on27\lib\site-packages\dill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:\pytho
n27\lib\site-packages\pox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
  Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
 (from pathos==0.2a1.dev0)
  Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)

Storing debug log for failure in C:\Users\xxx\pip\pip.log
我正在安装64位Windows 7。最后我设法用easy_install安装了它

但现在我失败了,因为我无法打开那么多文件:

Finished reading in Exposures...
Reading Samples from:  C:\XXX\XXX\XXX\
Traceback (most recent call last):
  File "events.py", line 568, in <module>
    mdrcv_dict = ReadDamages(damage_dir, value_dict)
  File "events.py", line 185, in ReadDamages
    res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
  File "C:\Python27\lib\site-packages\pathos-0.2a1.dev0-py2.7.egg\pathos\multipr
ocessing.py", line 230, in amap
    return _pool.map_async(star(f), zip(*args)) # chunksize
  File "events.py", line 184, in <genexpr>
    files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\\xx.csv'

如何对pathos执行相同的功能。多处理

只需在行上迭代,而不是读取整个文件。 像这样

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
    for dataline in data:
        split_line = record.split(',')
        if len(split_line) > 1:
试试这个:

for line in file('file.txt'):
    print line.rstrip()
当然,您也可以将它们添加到列表或对其执行其他操作,而不是打印它们

假设我们有file1.txt:

file2.txt:

等等,通过file5.txt:

我建议使用分层并行
map
快速读取文件。 一个分支的
多处理
(称为
pathos.multiprocessing
)可以做到这一点

>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>> 
>>> def rstrip(line):
...     return line.rstrip()
... 
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']
但是,如果要检查还有多少文件要完成,可能需要使用“迭代”映射(
imap
)而不是“异步”映射(
amap
)。有关详细信息,请参阅此帖子:


获取
pathos
此处:

请修复您的压痕。并向我们展示您使用的多处理代码。只需迭代打开的文件,在行尾拆分是默认行为。另外,看起来您正在解析CSV文件,您是否看到了
CSV
模块?您是否可以发布与多处理/多线程相关的代码(是哪个?)@PauloScardine我尝试了CSV模块,但遇到了相同的问题。请注意,如果任务似乎是I/O绑定的,并且多线程是针对CPU绑定的任务指示的。您可以同时从多个文件中读取数据,而无需多线程处理,如果事实上只打开多个文件并在同一线程中对其进行迭代可能会更好。根据:>在打开文件时,最好使用open()而不是直接调用此构造函数。文件更适合于类型测试(例如,编写isinstance(f,file))。谢谢,但我想做的不仅仅是并行剥离行。这是否有助于我避免内存问题,也就是说,我是否应该简单地使用此代码从每个文件生成列表数据,然后进行多重处理。如果您想做的不仅仅是拆分行,那么您只需要修改我提供的
rstrip
函数。您可以用数据处理功能替换my
rstrip
功能。关键是,这段代码一次一行地并行读取多个文件中的行……之后如何扩展它取决于您。如果文件中的集合数据非常大,那么您不能像我这样读取数据…您应该在
map
调用中增加
rstrip
来处理数据或应用一个减缩器(如
sum
reduce\u my\u data
或其他什么),但是我遇到了以下问题:对于pyre==0.8.2.0-pathos(来自pathos==0.2a1.dev0),根本找不到任何分布。如果您受到一次可以打开的文件数量的限制,那么您可以进行简单的修改,将
fnames
文件
拆分为500或100个文件。您可以将我上面的代码放入for循环或blocking
map
函数中。@Navonod您是如何解决pyr==0.8.2.0-pathos(from pathos==0.2a1.dev0)问题的?重温我自己的问题:pyre==0.8.2.0-pathos-这一个是从pip安装失败时安装的,为了读者方便,让它放在这里。
for line in file('file.txt'):
    print line.rstrip()
hello35
1234123
1234123
hello32
2492wow
1234125
1251234
1234123
1234123
2342bye
1234125
1251234
1234123
1234123
1234125
1251234
1234123
1234125
1251234
1234123
hello35
2492wow
1234125
1251234
1234123
1234123
hello32
1234125
1251234
1234123
1234123
1234123
1234123
2342bye
1234123
1234123
1234125
1251234
1234123
1234123
1234123
1234125
1251234
1234125
1251234
1234123
1234123
hello35
hello32
2492wow
2342bye
>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>> 
>>> def rstrip(line):
...     return line.rstrip()
... 
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']