Python 将Scrapy与Selenium一起使用时,连接被拒绝

Python 将Scrapy与Selenium一起使用时,连接被拒绝,python,selenium,scrapy,Python,Selenium,Scrapy,我正在尝试使用Scrapy和Selenium来创建一个包含动态生成的javascript内容的页面()。我总是被拒绝连接,但我不确定是我在做什么,还是服务器本身(它在中国,所以可能存在某种防火墙问题?) 我得到的是: Traceback (most recent call last): File "/usr/local/bin/scrapy", line 4, in <module> execute() File "/usr/local/lib/python2

我正在尝试使用Scrapy和Selenium来创建一个包含动态生成的javascript内容的页面()。我总是被拒绝连接,但我不确定是我在做什么,还是服务器本身(它在中国,所以可能存在某种防火墙问题?)

我得到的是:

    Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 4, in <module>
    execute()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 142, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 149, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 48, in run
    spider = crawler.spiders.create(spname, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 48, in create
    return spcls(**spider_kwargs)
  File "/opt/bitnami/apps/wordpress/htdocs/data/sina_crawler/sina_crawler/spiders/sspider.py", line 18, in __init__
    self.selenium.start()
  File "/usr/local/lib/python2.7/dist-packages/selenium/selenium.py", line 197, in start
    result = self.get_string("getNewBrowserSession", start_args)
  File "/usr/local/lib/python2.7/dist-packages/selenium/selenium.py", line 231, in get_string
    result = self.do_command(verb, args)
  File "/usr/local/lib/python2.7/dist-packages/selenium/selenium.py", line 220, in do_command
    conn.request("POST", "/selenium-server/driver/", body, headers)
  File "/usr/lib/python2.7/httplib.py", line 958, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 992, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 954, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 814, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 776, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 757, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
socket.error: [Errno 111] Connection refused
Exception socket.error: error(111, 'Connection refused') in <bound method SeleniumSpider.__del__ of <SeleniumSpider 'SeleniumSpider' at 0x1e246d0>> ignored

作为参考,我遵循以下示例代码:

这是最有可能的服务器端行为。你访问过他们的网站吗?我刚刚得到的页面没有任何问题
urllib
(显然没有JavaScript),因此我怀疑他们是否在使用复杂的方法检测机器人

我猜你在短时间内提出了太多的要求。我的处理方法是捕捉
ConnectionError
,然后使用
time.sleep(600)
休息一会儿。然后重试连接。您还可以计算在4或5次尝试后抛出并放弃
ConnectionError
的次数。它看起来像这样:

def parse(url, retry=0, max_retry=5):
    try:
        req = sel.open(url)
    except ConnectionError:
        if retry > max_retry: break
        logging.error('Connection error, resting...')
        time.sleep(100)
        self.parse(url, retry+1, max_retry)

起初我还以为它是服务器端的,但现在我很确定这是我的代码的问题。但仍然不知道在哪里。我几乎可以肯定,这不是因为从他们的网站发出了太多的请求,因为到目前为止,我只运行了15次spider,目前,脚本只加载一次单个页面。
def parse(url, retry=0, max_retry=5):
    try:
        req = sel.open(url)
    except ConnectionError:
        if retry > max_retry: break
        logging.error('Connection error, resting...')
        time.sleep(100)
        self.parse(url, retry+1, max_retry)