Python 请求和多处理
因此,我试图同时在多个网站上使用Python 请求和多处理,python,beautifulsoup,queue,multiprocessing,Python,Beautifulsoup,Queue,Multiprocessing,因此,我试图同时在多个网站上使用请求和BeautifulSoup,但由于某些原因,我无法使其正常工作。 以下是一个完整的示例: import multiprocessing as mp import requests from bs4 import BeautifulSoup from random import randint # Define an output queue class Spider(object): """docstring for Spider"""
请求
和BeautifulSoup
,但由于某些原因,我无法使其正常工作。
以下是一个完整的示例:
import multiprocessing as mp
import requests
from bs4 import BeautifulSoup
from random import randint
# Define an output queue
class Spider(object):
"""docstring for Spider"""
def __init__(self):
super(Spider, self).__init__()
# define a example function
def rand_string(length, output):
random_post=randint(1000000,9999999)
response=requests.get('https://stackoverflow.com/questions/'+str(random_post))
soup=BeautifulSoup(response.content,'lxml')
try:
title=soup.find('a',{'class':'question-hyperlink'}).string
except:
title="not found"
output.put(title)
# Setup a list of processes that we want to run
def run(self):
output = mp.Queue()
processes = [mp.Process(target=Spider.rand_string, args=(x, output)) for x in range(10)]
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
# Run processes
if __name__ == '__main__':
spider=Spider()
spider.run()
我添加了一堆调试打印语句来跟踪您的流程,并得出了一些结论
fork()
是一个巨大的痛苦rand\u string()
中:
title=soup.find('a',{'class':'question-hyperlink'}).string
这返回的是
,而不是
。当这被传递到mp.Queue.put()
时,尝试对其进行pickle以使其能够通过内部管道发送失败,并出现递归错误,从而导致队列暂停。我不确定是否可以通过酸洗管道发送bs4元素(如果您将引用循环转换为weakrefs?),但始终发送简单的python对象要容易得多。我还将队列的创建移到了主上下文中(在spider.run()
之外),不过这并不是特别必要的,因为它只由主线程执行。以下是我在最终迭代中的调试代码,因此您可以遵循我的测试方法:
from multiprocessing import Process, Queue, current_process
import requests
from bs4 import BeautifulSoup
from random import randint
import sys
#sys.setrecursionlimit(1000)
class Spider(object):
"""docstring for Spider"""
# define a example function
@staticmethod
def rand_string(length, output):
print("{} entry point".format(current_process().name))
random_post=randint(1000000,9999999)
response=requests.get('https://stackoverflow.com/questions/'+str(random_post))
print("{} got request response".format(current_process().name))
soup=BeautifulSoup(response.content,'lxml')
try:
title = soup.find('a',{'class':'question-hyperlink'}).string
except:
title = "not found"
print("{} got title: '{}' of type: {}".format(current_process().name, title, type(title)))
###### This did it ######
title = str(title) #fix or fake news?
output.put([title,current_process().name])
output.close()
print("{} exit point".format(current_process().name))
# Setup a list of processes that we want to run
# @staticmethod
def run(self, outq):
processes = []
for x in range(5):
processes.append(Process(target=self.rand_string, name="process_{}".format(x), args=(x, outq,),) )
print("creating process_{}".format(x))
for p in processes:
p.start()
print("{} started".format(p.name))
# Exit the completed processes
for p in processes:
p.join()
print("successuflly joined {}".format(p.name))
# Get process results from the output queue
print("joined all workers")
# return None
out = []
while not outq.empty():
result = outq.get()
print("got {}".format(result))
out.append(result)
return out
# Run processes
if __name__ == '__main__':
outq = Queue()
spider=Spider()
out = spider.run(outq)
print("done")
来自多处理导入进程、队列、当前\u进程
导入请求
从bs4导入BeautifulSoup
从随机导入randint
导入系统
#系统设置递归限制(1000)
类蜘蛛(对象):
“”“Spider的文档字符串”“”
#定义一个示例函数
@静力学方法
def rand_字符串(长度、输出):
打印(“{}入口点”。格式(当前进程().name))
random_post=randint(10000999999)
response=requests.get('https://stackoverflow.com/questions/“+str(随机邮政))
打印(“{}got request-response.”格式(当前进程().name))
soup=BeautifulSoup(response.content,'lxml')
尝试:
title=soup.find('a',{'class':'question-hyperlink'}).string
除:
title=“未找到”
打印(“{}已获取标题:{}”,类型:{}”。格式(当前_进程()。名称、标题、类型(标题)))
######这就成功了######
title=str(title)#修复还是伪造新闻?
output.put([title,current_process().name])
output.close()
打印(“{}退出点”。格式(当前进程().name))
#设置要运行的进程列表
#静态法
def运行(自身、outq):
进程=[]
对于范围(5)内的x:
append(Process(target=self.rand_string,name=“Process”{}).format(x),args=(x,outq,),)
打印(“创建进程”{}.format(x))
对于流程中的p:
p、 开始()
打印(“{}开始”。格式(p.name))
#退出已完成的流程
对于流程中的p:
p、 加入
打印(“成功连接{}”。格式(p.name))
#从输出队列获取流程结果
打印(“加入所有工人”)
#一无所获
out=[]
而不是outq.empty():
结果=outq.get()
打印(“获取{}”。格式(结果))
out.append(结果)
返回
#运行进程
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
outq=队列()
蜘蛛网=蜘蛛网()
out=星形轮运行(outq)
打印(“完成”)
以及运行所述代码的输出:
creating process_0
creating process_1
creating process_2
creating process_3
creating process_4
process_0 started
process_1 started
process_2 started
process_3 started
process_4 started
process_2 entry point
process_2 got request response
process_2 got title: 'not found' of type: <class 'str'>
process_2 exit point
process_0 entry point
process_0 got request response
process_0 got title: 'Starting Activity when video is finished playing' of type: <class 'bs4.element.NavigableString'>
process_0 exit point
successuflly joined process_0
process_3 entry point
process_3 got request response
process_3 got title: 'Just don't understand the point of these typedefs' of type: <class 'bs4.element.NavigableString'>
process_3 exit point
process_1 entry point
process_1 got request response
process_1 got title: 'Import button + File browse field in admin product grid in magento' of type: <class 'bs4.element.NavigableString'>
process_1 exit point
process_4 entry point
process_4 got request response
process_4 got title: 'How can I do a query with subselect' of type: <class 'bs4.element.NavigableString'>
process_4 exit point
successuflly joined process_1
successuflly joined process_2
successuflly joined process_3
successuflly joined process_4
joined all workers
got ['not found', 'process_2']
got ['Starting Activity when video is finished playing', 'process_0']
got ["Just don't understand the point of these typedefs", 'process_3']
got ['Import button + File browse field in admin product grid in magento', 'process_1']
got ['How can I do a query with subselect', 'process_4']
done
正在创建进程\u 0
创建进程1
创建进程2
创建过程3
创建过程4
进程0已启动
进程1已启动
进程2已启动
进程3已启动
进程4已启动
进程2入口点
进程2获得请求响应
进程2获得了类型为<class'str'>的标题:“未找到”
过程2出口点
进程0入口点
进程\u 0获得请求响应
进程0获得类型为<class'bs4.element.NavigableString'>的标题:“视频播放完成后开始活动”
进程0退出点
已成功加入进程\u 0
过程3入口点
进程3获得请求响应
进程3的标题是:<class'bs4.element.NavigableString'>
过程3出口点
流程1入口点
进程_1获得请求响应
进程1的标题为:“导入按钮+magento中管理产品网格中的文件浏览字段”,类型为:<class“bs4.element.NavigableString”>
过程1出口点
流程4入口点
进程4获得请求响应
进程4的标题为:<class'bs4.element.NavigableString'>
过程4出口点
成功加入过程_1
成功加入过程2
成功加入过程_3
成功加入过程4
加入所有工人
已获取['not found','process_2']
已获取['Starting Activity when video is Finish playing','process_0']
得到[“只是不理解这些typedef的意义”,“过程3”]
在magento的管理产品网格中获得了['Import button+文件浏览字段','process_1']
got['How can do a query with subselect','process_4']
你有没有试过确保单线程运行时一切顺利?多处理可能会使调试子进程变得困难。否则,您发布的代码中没有任何明显的内容,因此您的问题在别处。@Aaron,单线程运行得很好,但我对执行时间不满意。没有一个函数使用相同的资源/文件。嗯,我看不出您发布的代码有任何错误。所有的函数都足够相似吗?你可以把它们浓缩成一个函数,然后提供另一个参数(搜索网站)?如果是这样的话,我个人会尝试这样做,并使用已经构建的
多处理.pool
@Aaron我认为这与我在Windows机器上运行有关,但我还不知道如何更改代码。。。他们说英雄是不存在的!!Thx很多时候,我试过1231441的东西,我甚至记得现在读到了关于pickle队列对象的内容,但是我没有想到bs4.tag