Python：如何让不同的线程读取文件的不同部分_Python_Multithreading

Python：如何让不同的线程读取文件的不同部分

python multithreading

Python：如何让不同的线程读取文件的不同部分,python,multithreading,Python,Multithreading,我有一个大约300万行的文件。每一行都包含一些我想解析并将其发布到远程服务调用的数据如果我按顺序读取该文件，则整个程序需要很长时间才能完成运行我在考虑启动一个线程池，每个线程在文件的不同行上迭代（例如：线程1将读取第1行到第10行，线程2将读取第11行到第20行等等），这是典型的映射/减少问题。在python中有没有一种快速的方法来完成这项任务，任何库都可以帮助我完成这项任务。如果你逐行阅读文件，那么用python实现多线程是不容易的。因为seek（）方法需要知道每行的字节偏移量另一种方法

我有一个大约300万行的文件。每一行都包含一些我想解析并将其发布到远程服务调用的数据

如果我按顺序读取该文件，则整个程序需要很长时间才能完成运行

我在考虑启动一个线程池，每个线程在文件的不同行上迭代（例如：线程1将读取第1行到第10行，线程2将读取第11行到第20行等等），这是典型的映射/减少问题。在python中有没有一种快速的方法来完成这项任务，任何库都可以帮助我完成这项任务。

如果你逐行阅读文件，那么用python实现多线程是不容易的。因为seek（）方法需要知道每行的字节偏移量

另一种方法是先拆分文件，比如在linux上使用“拆分”。然后启动多个线程分别处理分割文件。

由于GIL，多线程可能对您没有帮助。我想你可以试试多重处理

from multiprocessing import Pool

def f(lines):
    for l in lines:
        print l

if __name__ == '__main__':
    f = open('file')
    total = 2
    lines = f.readlines()
    step = lines/total
    p = Pool(total)
    p.map(f, [lines[0:step],lines[step:step*total-1]])

您可以根据需要编写自己的MapReduce

请参阅下面的代码（）。此代码读取多个文件并生成输出。您可以修改代码，使每个辅助线程使用不同的文件段

import collections
import itertools
import multiprocessing

class SimpleMapReduce(object):

    def __init__(self, map_func, reduce_func, num_workers=None):
        """
        map_func

          Function to map inputs to intermediate data. Takes as
          argument one input value and returns a tuple with the key
          and a value to be reduced.

        reduce_func

          Function to reduce partitioned version of intermediate data
          to final output. Takes as argument a key as produced by
          map_func and a sequence of the values associated with that
          key.

        num_workers

          The number of workers to create in the pool. Defaults to the
          number of CPUs available on the current host.
        """
        self.map_func = map_func
        self.reduce_func = reduce_func
        self.pool = multiprocessing.Pool(num_workers)

    def partition(self, mapped_values):
        """Organize the mapped values by their key.
        Returns an unsorted sequence of tuples with a key and a sequence of values.
        """
        partitioned_data = collections.defaultdict(list)
        for key, value in mapped_values:
            partitioned_data[key].append(value)
        return partitioned_data.items()

    def __call__(self, inputs, chunksize=1):
        """Process the inputs through the map and reduce functions given.

        inputs
          An iterable containing the input data to be processed.

        chunksize=1
          The portion of the input data to hand to each worker.  This
          can be used to tune performance during the mapping phase.
        """
        map_responses = self.pool.map(self.map_func, inputs, chunksize=chunksize)
        partitioned_data = self.partition(itertools.chain(*map_responses))
        reduced_values = self.pool.map(self.reduce_func, partitioned_data)
        return reduced_values


import multiprocessing
import string

from multiprocessing_mapreduce import SimpleMapReduce

def file_to_words(filename):
    """Read a file and return a sequence of (word, occurances) values.
    """
    STOP_WORDS = set([
            'a', 'an', 'and', 'are', 'as', 'be', 'by', 'for', 'if', 'in', 
            'is', 'it', 'of', 'or', 'py', 'rst', 'that', 'the', 'to', 'with',
            ])
    TR = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

    print multiprocessing.current_process().name, 'reading', filename
    output = []

    with open(filename, 'rt') as f:
        for line in f:
            if line.lstrip().startswith('..'): # Skip rst comment lines
                continue
            line = line.translate(TR) # Strip punctuation
            for word in line.split():
                word = word.lower()
                if word.isalpha() and word not in STOP_WORDS:
                    output.append( (word, 1) )
    return output


def count_words(item):
    """Convert the partitioned data for a word to a
    tuple containing the word and the number of occurances.
    """
    word, occurances = item
    return (word, sum(occurances))


if __name__ == '__main__':
    import operator
    import glob

    input_files = glob.glob('*.rst')

    mapper = SimpleMapReduce(file_to_words, count_words)
    word_counts = mapper(input_files)
    word_counts.sort(key=operator.itemgetter(1))
    word_counts.reverse()

    print '\nTOP 20 WORDS BY FREQUENCY\n'
    top20 = word_counts[:20]
    longest = max(len(word) for word, count in top20)
    for word, count in top20:
        print '%-*s: %5s' % (longest+1, word, count)


$ python multiprocessing_wordcount.py

PoolWorker-1 reading basics.rst
PoolWorker-3 reading index.rst
PoolWorker-4 reading mapreduce.rst
PoolWorker-2 reading communication.rst

TOP 20 WORDS BY FREQUENCY

process         :    80
starting        :    52
multiprocessing :    40
worker          :    37
after           :    33
poolworker      :    32
running         :    31
consumer        :    31
processes       :    30
start           :    28
exiting         :    28
python          :    28
class           :    27
literal         :    26
header          :    26
pymotw          :    26
end             :    26
daemon          :    22
now             :    21
func            :    20

行的长度是否相同，或者至少是可预测的长度？我是否可以将其视为结果的顺序不必与文件中行的顺序一致？它们的长度不同，但每行都由一个新字符分隔。这是一个有3个字段的tsv文件。@thefourtheye，没错，顺序在这里并不重要。假设所有这些行都需要存储在数据库中。我建议您在尝试优化之前先分析现有代码。取决于它在哪里花费了大部分时间，多线程可能根本没有效果，甚至会使它变慢。