Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/331.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/sql-server-2005/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python/beautifulsoup问题中的多处理_Python_Python 2.7_Multiprocessing_Mechanize_Python Multiprocessing - Fatal编程技术网

python/beautifulsoup问题中的多处理

python/beautifulsoup问题中的多处理,python,python-2.7,multiprocessing,mechanize,python-multiprocessing,Python,Python 2.7,Multiprocessing,Mechanize,Python Multiprocessing,嗨,伙计们,我是python的新手。我试图做的是将我的旧代码转移到多处理中,但是我遇到了一些错误,我希望任何人都能帮我解决。我的代码用于检查文本表单中提供的几千个链接,以检查某些标记。一旦找到它就会输出给我。由于我有几千个链接要检查,速度是个问题,因此我需要转向多处理 更新:我有HTTP 503错误的返回错误。我是发出了太多的请求还是遗漏了什么 多处理代码: from mechanize import Browser from bs4 import BeautifulSoup import sy

嗨,伙计们,我是python的新手。我试图做的是将我的旧代码转移到多处理中,但是我遇到了一些错误,我希望任何人都能帮我解决。我的代码用于检查文本表单中提供的几千个链接,以检查某些标记。一旦找到它就会输出给我。由于我有几千个链接要检查,速度是个问题,因此我需要转向多处理

更新:我有HTTP 503错误的返回错误。我是发出了太多的请求还是遗漏了什么

多处理代码:

from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

no_stock = []

def main(lines):
    done = False
    tries = 1
    while tries and not done:
        try:
            r = br.open(lines, timeout=15)
            r = r.read()
            soup = BeautifulSoup(r,'html.parser')
            done = True # exit the loop
        except socket.timeout:
            print('Failed socket retrying')
            tries -= 1 # to exit when tries == 0
        except Exception as e: 
            print '%s: %s' % (e.__class__.__name__, e)
            print sys.exc_info()[0]
            tries -= 1 # to exit when tries == 0
    if not done:
        print('Failed for {}\n'.format(lines))
    table = soup.find_all('div', {'class' : "empty_result"})
    results = soup.find_all('strong', style = 'color: red;')
    if table or results:
        no_stock.append(lines)

if __name__ == "__main__":
    r = br.open('http://www.randomweb.com/') #avoid redirection
    fileName = "url.txt"
    pool = Pool(processes=2)
    with open(fileName, "r+") as f:
        lines = pool.map(main, f)
    with open('no_stock.txt','w') as f :
        f.write('No. of out of stock items : '+str(len(no_stock))+'\n'+'\n')
    for i in no_stock:
        f.write(i + '\n')
回溯:

Traceback (most recent call last):
  File "test2.py", line 43, in <module>
    lines = pool.map(main, f)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
UnboundLocalError: local variable 'soup' referenced before assignment

pool.map有两个参数,第一个是代码中的函数,是main,另一个是iterable,iterable的每一项都是代码中函数的参数,是文件的每一行

您需要将回溯添加到主帖子中。你好,Morgan,谢谢您的回复,但我不太明白您的意思。你能再解释一下吗?你需要添加到你的主要帖子中来添加错误。@AidanKane问题是操作缩进。不要修改它。我在编辑中回滚了缩进更改。在Python中,空格很重要——如果编辑询问者的缩进,这是一个不同的问题。可以肯定的是:编辑问题中的空格与帮助正好相反。感谢提示,我已经在其他部分对iterable做了一些修复,但是代码似乎不起作用。这是给我的错误,你能发布完整的回溯为什么你需要TQM?在主文件中传递的行的值是文件的一行,它响应网站,因此,您可以只使用线到浏览器Openi使用TQDM,只是让我知道进展如何。这是因为有时我的程序被夹在中间。因此,我有一个参考。我已经用提供的代码更新了我的主要帖子中的回溯。我已经将变量soup设置为global,它用另一个我丢失的错误更新了我
http://www.randomweb.com/item.htm?uuid=44733096229
http://www.randomweb.com/item.htm?uuid=4473309622789
http://www.randomweb.com/item.htm?uuid=447330962291
....etc
from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

br = Browser()

no_stock = []

def main(line):
    done = False
    tries = 3
    while tries and not done:
        try:
            r = br.open(line, timeout=15)
            r = r.read()
            soup = BeautifulSoup(r,'html.parser')
            done = True # exit the loop
        except socket.timeout:
            print('Failed socket retrying')
            tries -= 1 # to exit when tries == 0
        except:
            print('Random fail retrying')
            print sys.exc_info()[0]
            tries -= 1 # to exit when tries == 0
    if not done:
        print('Failed for {}\n'.format(i))
    table = soup.find_all('div', {'class' : "empty_result"})
    results = soup.find_all('strong', style = 'color: red;')
    if table or results:
        no_stock.append(i)

if __name__ == "__main__":
    fileName = "url.txt"
    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.
    with open(fileName, "rb") as f:
        lines = pool.map(main, f)
    with open('no_stock.txt','w') as f :
        f.write('No. of out of stock items : '+str(len(no_stock))+'\n'+'\n')
    for i in no_stock:
        f.write(i + '\n')