Python 请求和多处理_Python_Beautifulsoup_Queue_Multiprocessing

Python 请求和多处理

python

Python 请求和多处理,python,beautifulsoup,queue,multiprocessing,Python,Beautifulsoup,Queue,Multiprocessing,因此，我试图同时在多个网站上使用请求和BeautifulSoup，但由于某些原因，我无法使其正常工作。以下是一个完整的示例： import multiprocessing as mp import requests from bs4 import BeautifulSoup from random import randint # Define an output queue class Spider(object): """docstring for Spider"""

因此，我试图同时在多个网站上使用

请求

和

BeautifulSoup

，但由于某些原因，我无法使其正常工作。以下是一个完整的示例：

import multiprocessing as mp
import requests
from bs4 import BeautifulSoup
from random import randint

# Define an output queue


class Spider(object):
    """docstring for Spider"""
    def __init__(self):
        super(Spider, self).__init__()

    # define a example function
    def rand_string(length, output):
        random_post=randint(1000000,9999999)
        response=requests.get('https://stackoverflow.com/questions/'+str(random_post))
        soup=BeautifulSoup(response.content,'lxml')
        try:
            title=soup.find('a',{'class':'question-hyperlink'}).string
        except:
            title="not found"

        output.put(title)

    # Setup a list of processes that we want to run
    def run(self):

        output = mp.Queue()
        processes = [mp.Process(target=Spider.rand_string, args=(x, output)) for x in range(10)]

        for p in processes:
            p.start()

        # Exit the completed processes

        for p in processes:
            p.join()

        # Get process results from the output queue

        results = [output.get() for p in processes]
        print(results)

# Run processes

if __name__ == '__main__':

    spider=Spider()
    spider.run()

我添加了一堆调试打印语句来跟踪您的流程，并得出了一些结论

您有时可能会遇到bs4的递归深度限制

您之前链接的答案（在评论中）确实与您的问题相关

windows没有

fork（）

是一个巨大的痛苦

您的主要错误出现在带有以下行的

rand\u string（）

中：

title=soup.find('a',{'class':'question-hyperlink'}).string

这返回的是

，而不是

。当这被传递到

mp.Queue.put（）

时，尝试对其进行pickle以使其能够通过内部管道发送失败，并出现递归错误，从而导致队列暂停。我不确定是否可以通过酸洗管道发送bs4元素（如果您将引用循环转换为weakrefs？），但始终发送简单的python对象要容易得多。我还将队列的创建移到了主上下文中（在

spider.run（）

之外），不过这并不是特别必要的，因为它只由主线程执行。以下是我在最终迭代中的调试代码，因此您可以遵循我的测试方法：

from multiprocessing import Process, Queue, current_process
import requests
from bs4 import BeautifulSoup
from random import randint
import sys
#sys.setrecursionlimit(1000)

class Spider(object):
    """docstring for Spider"""

    # define a example function
    @staticmethod
    def rand_string(length, output):

        print("{} entry point".format(current_process().name))
        random_post=randint(1000000,9999999)
        response=requests.get('https://stackoverflow.com/questions/'+str(random_post))
        print("{} got request response".format(current_process().name))
        soup=BeautifulSoup(response.content,'lxml')
        try:
            title = soup.find('a',{'class':'question-hyperlink'}).string
        except:
            title = "not found"

        print("{} got title: '{}' of type: {}".format(current_process().name, title, type(title)))

        ###### This did it ######
        title = str(title) #fix or fake news?

        output.put([title,current_process().name])
        output.close()
        print("{} exit point".format(current_process().name))


    # Setup a list of processes that we want to run
#    @staticmethod
    def run(self, outq):
        processes = []
        for x in range(5):
                processes.append(Process(target=self.rand_string, name="process_{}".format(x), args=(x, outq,),) )
                print("creating process_{}".format(x))

        for p in processes:
            p.start()
            print("{} started".format(p.name))

        # Exit the completed processes
        for p in processes:
            p.join()
            print("successuflly joined {}".format(p.name))

        # Get process results from the output queue
        print("joined all workers")
#        return None
        out = []
        while not outq.empty():
            result = outq.get()
            print("got {}".format(result))
            out.append(result)
        return out

# Run processes
if __name__ == '__main__':
    outq = Queue()
    spider=Spider()
    out = spider.run(outq)
    print("done")

来自多处理导入进程、队列、当前\u进程
导入请求
从bs4导入BeautifulSoup
从随机导入randint
导入系统
#系统设置递归限制（1000）
类蜘蛛（对象）：
“”“Spider的文档字符串”“”
#定义一个示例函数
@静力学方法
def rand_字符串（长度、输出）：
打印（“{}入口点”。格式（当前进程（）.name））
random_post=randint（10000999999）
response=requests.get（'https://stackoverflow.com/questions/“+str（随机邮政））
打印（“{}got request-response.”格式（当前进程（）.name））
soup=BeautifulSoup（response.content，'lxml'）
尝试：
title=soup.find（'a'，{'class'：'question-hyperlink'}）.string
除：
title=“未找到”
打印（“{}已获取标题：{}”，类型：{}”。格式（当前_进程（）。名称、标题、类型（标题）））
######这就成功了######
title=str（title）#修复还是伪造新闻？
output.put（[title，current_process（）.name]）
output.close（）
打印（“{}退出点”。格式（当前进程（）.name））
#设置要运行的进程列表
#静态法
def运行（自身、outq）：
进程=[]
对于范围（5）内的x：
append（Process（target=self.rand_string，name=“Process”{}）.format（x），args=（x，outq，），）
打印（“创建进程”{}.format（x））
对于流程中的p：
p、 开始（）
打印（“{}开始”。格式（p.name））
#退出已完成的流程
对于流程中的p：
p、 加入
打印（“成功连接{}”。格式（p.name））
#从输出队列获取流程结果
打印（“加入所有工人”）
#一无所获
out=[]
而不是outq.empty（）：
结果=outq.get（）
打印（“获取{}”。格式（结果））
out.append（结果）
返回
#运行进程
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
outq=队列（）
蜘蛛网=蜘蛛网（）
out=星形轮运行（outq）
打印（“完成”）

以及运行所述代码的输出：

creating process_0 creating process_1 creating process_2 creating process_3 creating process_4 process_0 started process_1 started process_2 started process_3 started process_4 started process_2 entry point process_2 got request response process_2 got title: 'not found' of type: <class 'str'> process_2 exit point process_0 entry point process_0 got request response process_0 got title: 'Starting Activity when video is finished playing' of type: <class 'bs4.element.NavigableString'> process_0 exit point successuflly joined process_0 process_3 entry point process_3 got request response process_3 got title: 'Just don't understand the point of these typedefs' of type: <class 'bs4.element.NavigableString'> process_3 exit point process_1 entry point process_1 got request response process_1 got title: 'Import button + File browse field in admin product grid in magento' of type: <class 'bs4.element.NavigableString'> process_1 exit point process_4 entry point process_4 got request response process_4 got title: 'How can I do a query with subselect' of type: <class 'bs4.element.NavigableString'> process_4 exit point successuflly joined process_1 successuflly joined process_2 successuflly joined process_3 successuflly joined process_4 joined all workers got ['not found', 'process_2'] got ['Starting Activity when video is finished playing', 'process_0'] got ["Just don't understand the point of these typedefs", 'process_3'] got ['Import button + File browse field in admin product grid in magento', 'process_1'] got ['How can I do a query with subselect', 'process_4'] done 正在创建进程\u 0 创建进程1 创建进程2 创建过程3 创建过程4 进程0已启动进程1已启动进程2已启动进程3已启动进程4已启动进程2入口点进程2获得请求响应进程2获得了类型为<class'str'>的标题：“未找到” 过程2出口点进程0入口点进程\u 0获得请求响应进程0获得类型为<class'bs4.element.NavigableString'>的标题：“视频播放完成后开始活动” 进程0退出点已成功加入进程\u 0 过程3入口点进程3获得请求响应进程3的标题是：<class'bs4.element.NavigableString'> 过程3出口点流程1入口点进程_1获得请求响应进程1的标题为：“导入按钮+magento中管理产品网格中的文件浏览字段”，类型为：<class“bs4.element.NavigableString”> 过程1出口点流程4入口点进程4获得请求响应进程4的标题为：<class'bs4.element.NavigableString'> 过程4出口点成功加入过程_1 成功加入过程2 成功加入过程_3 成功加入过程4 加入所有工人已获取['not found'，'process_2'] 已获取['Starting Activity when video is Finish playing'，'process_0'] 得到[“只是不理解这些typedef的意义”，“过程3”] 在magento的管理产品网格中获得了['Import button+文件浏览字段'，'process_1'] got['How can do a query with subselect'，'process_4']

你有没有试过确保单线程运行时一切顺利？多处理可能会使调试子进程变得困难。否则，您发布的代码中没有任何明显的内容，因此您的问题在别处。@Aaron，单线程运行得很好，但我对执行时间不满意。没有一个函数使用相同的资源/文件。嗯，我看不出您发布的代码有任何错误。所有的函数都足够相似吗？你可以把它们浓缩成一个函数，然后提供另一个参数（搜索网站）？如果是这样的话，我个人会尝试这样做，并使用已经构建的

多处理.pool

@Aaron我认为这与我在Windows机器上运行有关，但我还不知道如何更改代码。。。他们说英雄是不存在的！！Thx很多时候，我试过1231441的东西，我甚至记得现在读到了关于pickle队列对象的内容，但是我没有想到bs4.tag