Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/multithreading/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python:urlopen&;穿线速度过慢?有没有更快的办法?_Python_Multithreading_Performance_Http_Io - Fatal编程技术网

python:urlopen&;穿线速度过慢?有没有更快的办法?

python:urlopen&;穿线速度过慢?有没有更快的办法?,python,multithreading,performance,http,io,Python,Multithreading,Performance,Http,Io,我正在编写一个客户端,它一次加载和解析多个页面,并将数据从页面发送到服务器。如果我一次只运行一个页面处理器,事情会进展得相当顺利: ********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.98s (1.60s load html, 0.24s parse, 0.00s on queue, 0.14s to process) ********** ********** Round-trip (with 0 se

我正在编写一个客户端,它一次加载和解析多个页面,并将数据从页面发送到服务器。如果我一次只运行一个页面处理器,事情会进展得相当顺利:

********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.98s (1.60s load html, 0.24s parse, 0.00s on queue, 0.14s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.87s (1.59s load html, 0.25s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 2.79s (1.78s load html, 0.28s parse, 0.00s on queue, 0.72s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 2.18s (1.70s load html, 0.34s parse, 0.00s on queue, 0.15s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.91s (1.47s load html, 0.21s parse, 0.00s on queue, 0.23s to process) **********
********** Round-trip (with 0 sends/1 loads) for (+0/.0/-0) was total 1.84s (1.59s load html, 0.22s parse, 0.00s on queue, 0.03s to process) **********
********** Round-trip (with 0 sends/0 loads) for (+0/.0/-0) was total 1.90s (1.67s load html, 0.21s parse, 0.00s on queue, 0.02s to process) **********
但是,当一次运行约20次时(每个线程都有自己的线程),HTTP流量会变得非常缓慢:

********** Round-trip (with 2 sends/7 loads) for (+0/.0/-0) was total 23.37s (16.39s load html, 0.30s parse, 0.00s on queue, 6.67s to process) **********
********** Round-trip (with 2 sends/5 loads) for (+0/.0/-0) was total 20.99s (14.00s load html, 1.99s parse, 0.00s on queue, 5.00s to process) **********
********** Round-trip (with 4 sends/4 loads) for (+0/.0/-0) was total 17.89s (9.17s load html, 0.30s parse, 0.12s on queue, 8.31s to process) **********
********** Round-trip (with 3 sends/5 loads) for (+0/.0/-0) was total 26.22s (15.34s load html, 1.63s parse, 0.01s on queue, 9.24s to process) **********
load html
位是读取我正在处理的网页的html所需的时间(
resp=self.mech.open(url)
to
resp.read();resp.close()
)。要处理的
位是从该客户端到处理它的服务器的往返时间(
fp=urllib2.urlopen(…);fp.read();fp.close()
)。
X sends/Y loads
位是向服务器发出请求时,同时发送到服务器和从我正在处理的网页加载的次数

我最关心的是要处理的
位。服务器上的实际处理只需要
0.2s
左右。只有400字节被发送,所以这不是占用太多带宽的问题。有趣的是,如果我运行一个程序(当解析与所有这些同步发送/加载一起进行时),它打开5个线程并重复执行
来处理
位,那么它的速度会非常快:

1 took 0.04s
1 took 1.41s in total
0 took 0.03s
0 took 1.43s in total
4 took 0.33s
2 took 0.49s
2 took 0.08s
2 took 0.01s
2 took 1.74s in total
3 took 0.62s
4 took 0.40s
3 took 0.31s
4 took 0.33s
3 took 0.05s
3 took 2.18s in total
4 took 0.07s
4 took 2.22s in total
在这个独立程序中,每个
处理
只需要
0.01秒
0.50秒
,远远少于完整版本中的6-10秒,并且它使用的发送线程并不更少(它使用5个,完整版本的上限为5个)

也就是说,当完整版本运行时,运行单独的版本发送每个400字节的相同
(+0/.0/-0)
请求,每个请求只需要
0.31
s。所以,它不像我运行的机器被点击。。。更确切地说,其他线程中的多个同时加载正在减慢其他线程中发送的应该是快的(实际上是快的,在同一台机器上运行的其他程序中)

发送是通过
urlib2.urlopen
完成的,而读取是通过mechanize完成的(它最终使用
urlib2.urlopen
的分支)

有没有一种方法可以让完整的程序运行得像这个迷你单机版一样快,至少在发送相同的东西时是这样?我正在考虑编写另一个程序,只接收通过命名管道或其他方式发送的内容,以便在另一个进程中完成发送,但不知何故,这似乎很愚蠢。欢迎提出任何建议

任何关于如何更快地同时加载多个页面的建议(因此时间看起来更像是1-3秒,而不是10-20秒)也将受到欢迎


编辑:附加说明:我依赖mechanize的cookie处理功能,所以任何答案都会提供一种理想的处理方法,以及


编辑:我用不同的配置设置了相同的设置,只打开一个页面,同时向队列中添加10-20个内容。那些被加工得像一把刀穿过黄油,例如,这里是添加了一整束黄油的末尾:

********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.17s (1.14s wait, 0.04s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.19s (1.16s wait, 0.03s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.26s (0.80s wait, 0.46s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+0/.0/-0) was total 1.35s (0.77s wait, 0.58s to process) **********
********** Round-trip (with 4 sends/0 loads) for (+2/.4/-0) was total 1.44s (0.24s wait, 1.20s to process) **********
(我添加了
wait
计时,它是信息在发送之前在队列上停留的时间。)注意
处理
的速度与单机程序一样快。这个问题只表现在不断阅读和解析网页的问题上。(请注意,解析本身需要大量CPU)



编辑:一些初步测试表明我应该对每个网页加载使用一个单独的过程。。。一旦启动并运行,将发布更新

可能是全局解释器锁(GIL)。您是否尝试了多处理模块(主要是线程的替代品,IIRC)


另请参见

所有不同线程之间是否存在共享资源?也许锁/互斥锁的获取和释放导致了很长的等待时间?@JustinDanielson:简短回答:不。处理
的时间实际上是在响应关闭后调用
urllib2之前所花费的时间。urlopen调用是在响应关闭之后进行的-这是严格意义上的I/O旅行时间,不涉及阻塞。这是我希望最小化的数字。我在我的答案中添加了一些信息,请查看。多处理解决了所有问题!页面加载时间、解析时间和服务器往返时间都减少了10倍。多道处理值得称赞!(或者:哇,python线程真的很糟糕!)