使用BeautifulSoup 4的Python中的多处理问题

使用BeautifulSoup 4的Python中的多处理问题,python,multithreading,python-2.7,beautifulsoup,multiprocessing,Python,Multithreading,Python 2.7,Beautifulsoup,Multiprocessing,我正在使用大多数或所有内核来更快地处理文件,它可以一次读取多个文件,也可以使用多个内核来读取单个文件 我更喜欢使用多个内核来读取单个文件,然后再将其移动到下一个文件 我尝试了下面的代码,但似乎无法使用所有的核心 下面的代码基本上检索包含HTML的目录中json格式的*.txt文件 #!/usr/bin/python # -*- coding: utf-8 -*- import requests import json import urlparse

我正在使用大多数或所有内核来更快地处理文件,它可以一次读取多个文件,也可以使用多个内核来读取单个文件

我更喜欢使用多个内核来读取单个文件,然后再将其移动到下一个文件

我尝试了下面的代码,但似乎无法使用所有的核心

下面的代码基本上检索包含HTML的目录中json格式的*.txt文件

   #!/usr/bin/python
    # -*- coding: utf-8 -*-
    import requests
    import json
    import urlparse
    import os
    from bs4 import BeautifulSoup
    from multiprocessing.dummy import Pool  # This is a thread-based Pool
    from multiprocessing import cpu_count

    def crawlTheHtml(htmlsource):
        htmlArray = json.loads(htmlsource)
        for eachHtml in htmlArray:
            soup = BeautifulSoup(eachHtml['result'], 'html.parser')
            if all(['another text to search' not in str(soup),
                   'text to search' not in str(soup)]):
                try:
                    gd_no = ''
                    try:
                        gd_no = soup.find('input', {'id': 'GD_NO'})['value']
                    except:
                        pass

                    r = requests.post('domain api address', data={
                        'gd_no': gd_no,
                        })
                except:
                    pass


    if __name__ == '__main__':
        pool = Pool(cpu_count() * 2)
        print(cpu_count())
        fileArray = []
        for filename in os.listdir(os.getcwd()):
            if filename.endswith('.txt'):
                fileArray.append(filename)
        for file in fileArray:
            with open(file, 'r') as myfile:
                htmlsource = myfile.read()
                results = pool.map(crawlTheHtml(htmlsource), f)
除此之外,我不确定,f代表什么

问题1:

我没有正确地做什么来充分利用所有的核心/线程

问题2:


是否有更好的方法使用try:except:因为有时该值不在页面中,这会导致脚本停止。在处理多个变量时,我会给出很多try&except语句。

回答问题1,您的问题是这一行:

from multiprocessing.dummy import Pool  # This is a thread-based Pool
答复来自:

使用multiprocessing.dummy时,使用的是线程,而不是进程:

multiprocessing.dummy复制多处理的API,但不是 不仅仅是线程模块的包装器

这意味着您受到限制,一次只能有一个线程执行CPU限制的操作。这将使您无法充分利用CPU。如果您想在所有可用内核之间获得完全并行性,则需要解决使用
multiprocessing.Pool

遇到的酸洗问题 你需要做什么

from multiprocessing import Pool
from multiprocessing import freeze_support
你最终需要做的是

if __name__ = '__main__':
  freeze_support()

你可以继续你的脚本

“解决酸洗问题”你能详细说明一下吗?:)因为进程不共享内存,所以它们需要一种共享信息的方式。他们使用
Pickle
模块通过一种称为序列化的机制来实现这一点。当ByTestStream将python对象反序列化为对象时,它只是通过在进程中将其序列化为ByTestStream来转换python对象,然后再将其发送到另一个进程。
from  multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random

MAX_WORKERS=10

class Testing_mp(object):
    def __init__(self):
        """
        Initiates a queue, a pool and a temporary buffer, used only
        when the queue is full.
        """
        self.q = Queue()
        self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
        self.temp_buffer = []

    def add_to_queue(self, msg):
        """
        If queue is full, put the message in a temporary buffer.
        If the queue is not full, adding the message to the queue.
        If the buffer is not empty and that the message queue is not full,
        putting back messages from the buffer to the queue.
        """
        if self.q.full():
            self.temp_buffer.append(msg)
        else:
            self.q.put(msg)
            if len(self.temp_buffer) > 0:
                add_to_queue(self.temp_buffer.pop())

    def write_to_queue(self):
        """
        This function writes some messages to the queue.
        """
        for i in range(50):
            self.add_to_queue("First item for loop %d" % i)
            # Not really needed, just to show that some elements can be added
            # to the queue whenever you want!
            sleep(random()*2)
            self.add_to_queue("Second item for loop %d" % i)
            # Not really needed, just to show that some elements can be added
            # to the queue whenever you want!
            sleep(random()*2)

    def worker_main(self):
        """
        Waits indefinitely for an item to be written in the queue.
        Finishes when the parent process terminates.
        """
        print "Process {0} started".format(getpid())
        while True:
            # If queue is not empty, pop the next element and do the work.
            # If queue is empty, wait indefinitly until an element get in the queue.
            item = self.q.get(block=True, timeout=None)
            print "{0} retrieved: {1}".format(getpid(), item)
            # simulate some random length operations
            sleep(random())

# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
    mp_class = Testing_mp()
    mp_class.write_to_queue()
    # Waits a bit for the child processes to do some work
    # because when the parent exits, childs are terminated.
    sleep(5)