Proxy 如何在scrapy中使用代理文件?
我用proxybroker得到了代理列表Proxy 如何在scrapy中使用代理文件?,proxy,scrapy,http-proxy,Proxy,Scrapy,Http Proxy,我用proxybroker得到了代理列表 sudo pip install proxybroker proxybroker grab --countries US --limit 100 --outfile proxies.txt 使用grep将格式从更改为104.131.6.78:80 grep -oP \([0-9]+.\){3}[0-9]+:[0-9]+ proxies.txt > proxy.csv proxy.csv中的所有代理均采用以下格式 cat proxy.
sudo pip install proxybroker
proxybroker grab --countries US --limit 100 --outfile proxies.txt
使用grep将格式从
更改为104.131.6.78:80
grep -oP \([0-9]+.\){3}[0-9]+:[0-9]+ proxies.txt > proxy.csv
proxy.csv中的所有代理均采用以下格式
cat proxy.csv
104.131.6.78:80
104.197.16.8:3128
104.131.94.221:8080
63.110.242.67:3128
我根据网页写了我的涂鸦器。这是我的框架结构--test.py 使用
scrapy runspider test.py运行爬行器时会出现错误信息
连接被另一方拒绝:111:连接被拒绝
使用从proxybroker
获得的相同代理,我使用自己的方式下载url集,而不是使用scrapy。
为了简单起见,所有损坏的代理ip都将保留,而不是删除。
下面的代码片段用于测试是否可以使用代理ip而不是完全下载url集。
程序结构如下所示
import time
import csv,os,urllib.request
data_dir = "/tmp/"
urls = set #omit how to get it.
csvfile = open(data_dir + 'proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader]
ip_len = len(ippool)
ipth = 0
for ith,item in enumerate(urls):
time.sleep(2)
flag = 1
if ipth >= ip_len : ipth =0
while(ipth <ip_len and flag == 1):
try :
handler = urllib.request.ProxyHandler({'http':ippool[ipth]})
opener = urllib.request.build_opener(handler)
urllib.request.install_opener(opener)
response = urllib.request.urlopen(urls[ith]).read().decode("utf8")
fh = open(data_dir + str(ith),"w")
fh.write(response)
fh.close()
ipth = ipth + 1
flag = 0
print(urls[ith] + "downloaded")
except :
print("can not downloaded" + urls[ith])
导入时间
导入csv、os、urllib.request
data_dir=“/tmp/”
URL=set#省略如何获取它。
csvfile=open(数据目录+proxy.csv)
reader=csv.reader(csvfile)
ippool=[读卡器中的行的行[0]
ip_len=len(ippool)
ipth=0
对于第i个,枚举中的项(URL):
时间。睡眠(2)
标志=1
如果ipth>=ip\u len:ipth=0
而(ipth尝试使用
在Settings.py
中,可以进行如下更改:
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
希望这能帮助你,因为这也解决了我的问题。这显然意味着代理不起作用……你买了吗?如果这些是免费的,那就别指望它们能起作用。大多数都能起作用。我已经使用scrapy with proxy两年多了,根据我的经验,错误意味着你的代理不允许你连接
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"