Python 使用带有标题和代理的urllib抓取网页数据_Python_Proxy_Web Scraping_Urllib_Http Proxy

Python 使用带有标题和代理的urllib抓取网页数据

python proxy web-scraping

Python 使用带有标题和代理的urllib抓取网页数据,python,proxy,web-scraping,urllib,http-proxy,Python,Proxy,Web Scraping,Urllib,Http Proxy,我已经得到了网页数据，但现在我想用代理获取它。我怎么做呢 import urllib def get_main_html(): request = urllib.request.Request(URL, headers=headers) doc = lh.parse(urllib.request.urlopen(request)) return doc 从文件中 urllib将自动检测代理设置并使用这些设置。这是通过ProxyHandler实现的，当检测到代理设置时，Pro

我已经得到了网页数据，但现在我想用代理获取它。我怎么做呢

import urllib

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc

从文件中

urllib将自动检测代理设置并使用这些设置。这是通过ProxyHandler实现的，当检测到代理设置时，ProxyHandler是普通处理程序链的一部分。通常这是一件好事，但在某些情况下，这可能没有帮助。一种方法是设置我们自己的ProxyHandler，不定义代理。这是使用与设置基本身份验证句柄类似的步骤完成的

选中此项，

使用：

proxies = {'http': 'http://myproxy.example.com:1234'}
print "Using HTTP proxy %s" % proxies['http']
urllib.urlopen("http://yoursite", proxies=proxies)

你可以用

就你而言：

import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc

import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)

def get_main_html():
   request = urllib.request.Request(URL, headers=headers)
   doc = lh.parse(urllib.request.urlopen(request))
   return doc