Python 在VPS上运行Selenium webdriver时出现各种Urllib2错误_Python_Selenium Webdriver_Web Scraping_Vps_Headless Browser

Python 在VPS上运行Selenium webdriver时出现各种Urllib2错误

python selenium-webdriver web-scraping

Python 在VPS上运行Selenium webdriver时出现各种Urllib2错误,python,selenium-webdriver,web-scraping,vps,headless-browser,Python,Selenium Webdriver,Web Scraping,Vps,Headless Browser,我正在使用Selenium和Python绑定从带有headless Firefox的网页中抓取AJAX内容。在我的本地机器上运行时，它工作得非常好。当我在VPS上运行完全相同的脚本时，错误会在看似随机（但一致）的行上抛出。我的本地和远程系统具有完全相同的操作系统/体系结构，因此我猜差异与VPS有关对于每个回溯，在抛出错误之前，该行将运行4次。在执行JavaScript将元素滚动到视图中时，我经常会遇到这个错误 File "google_scrape.py", line 18, in _get

我正在使用Selenium和Python绑定从带有headless Firefox的网页中抓取AJAX内容。在我的本地机器上运行时，它工作得非常好。当我在VPS上运行完全相同的脚本时，错误会在看似随机（但一致）的行上抛出。我的本地和远程系统具有完全相同的操作系统/体系结构，因此我猜差异与VPS有关

对于每个回溯，在抛出错误之前，该行将运行4次。

在执行JavaScript将元素滚动到视图中时，我经常会遇到这个错误

File "google_scrape.py", line 18, in _get_data
    driver.execute_script("arguments[0].scrollIntoView(true);", e)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
    {'script': script, 'args':converted_args})['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

有几次我收到一个套接字错误：

  File "google_scrape.py", line 19, in _get_data
    if e.text.strip():
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
    return self._parent.execute(command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer

我是在没有代理的情况下从谷歌抓取的，所以我的第一个想法是，我的IP地址被识别为VPS，并被置于5次页面操作限制之下。但我的初步研究表明，这些错误不会因为被阻止而产生

如果您能深入了解这些错误的总体含义，或从VPS发出HTTP请求时的必要注意事项，我们将不胜感激

更新在对webdriver的真正含义——自动浏览器输入——进行了一番思考和研究之后，我应该对

remote_connection.py

发出

urllib2

请求的原因感到困惑。似乎

WebElement

类的

text

方法是python绑定的一个“额外”功能，它不是Selenium核心的一部分。这并不能解释上述错误，但可能表明不应使用

text

方法进行刮取

更新2 我意识到，出于我的目的，Selenium的唯一功能是加载ajax内容。因此，页面加载后，我将使用

lxml

解析源代码，而不是使用Selenium获取元素，即：

html = lxml.html.fromstring(driver.page_source)

然而，

page\u source

是导致调用

urllib2

的另一种方法，我在第二次使用它时始终得到

BadStatusLine

错误。最小化

urllib2

请求无疑是朝着正确方向迈出的一步

更新3 通过使用javascript获取源代码来消除

urllib2

请求更好：

html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))

结论这些错误可以通过在每几次请求之间执行

time.sleep（10）

来避免。我提出的最好的解释是，谷歌的防火墙将我的IP识别为VPS，因此将其置于更严格的屏蔽规则之下

这是我最初的想法，但我仍然很难相信，因为我的网络搜索没有显示上述错误可能是由防火墙引起的

如果是这样的话，我认为更严格的规则可以通过代理来规避，尽管该代理可能必须是本地系统或tor才能避免相同的限制。

根据我们的对话，您发现即使是少量的日常刮擦，谷歌也有反刮擦阻止。解决方案是在每次提取之间延迟几秒钟

在一般情况下，由于您要从技术上将不可恢复的成本转移给第三方，因此最好尽量减少您在远程服务器上施加的额外资源负载。如果HTTP抓取之间没有暂停，快速服务器和连接可能会导致远程拒绝服务，特别是对没有Google服务器资源的目标进行抓取。

如果您要抓取Google搜索结果，无头浏览器（imo）是一种太复杂的方法。考虑一个非JavaScript的刮削器，如刮削；如果没有客户端脚本，Google将可以正常工作。更好的是，你能使用谷歌搜索API吗？谢谢你的关注。事实上，我不是在抓取搜索结果，这就是为什么我不愿提及谷歌是域名的原因。我正在抓取ajax内容，所以我需要一些东西来加载javascript。这就是说，一旦加载了内容，就没有理由继续使用Selenium，因此我目前正在修改脚本，以便在加载页面后立即调用webdriver的

page\u source

方法，关闭驱动程序，然后用

lxml

解析源代码。我正在删除ajax内容，因此我需要一些东西来加载JavaScript。-你能直接连接到这个URL吗？如果是这样，您可能不需要运行JavaScript，除非响应实际上包含JavaScript。如果它只包含JSON/HTML/XML，并且URL的生成不需要JavaScript，那么您可以使用Scrapy。好的，当然值得在每个请求之间延迟几秒钟，以防万一。是的@halfer，请求之间的暂停最终解决了我的整个问题，尽管首先最小化请求非常有用。如果你在答复中这样说，我会接受的。

html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))