Proxy 如何在scrapy中使用代理文件?

Proxy 如何在scrapy中使用代理文件?,proxy,scrapy,http-proxy,Proxy,Scrapy,Http Proxy,我用proxybroker得到了代理列表 sudo pip install proxybroker proxybroker grab --countries US --limit 100 --outfile proxies.txt 使用grep将格式从更改为104.131.6.78:80 grep -oP \([0-9]+.\){3}[0-9]+:[0-9]+ proxies.txt > proxy.csv proxy.csv中的所有代理均采用以下格式 cat proxy.

我用proxybroker得到了代理列表

sudo pip install proxybroker
proxybroker grab --countries US --limit 100 --outfile proxies.txt
使用grep将格式从
更改为
104.131.6.78:80

 grep -oP  \([0-9]+.\){3}[0-9]+:[0-9]+   proxies.txt   > proxy.csv
proxy.csv中的所有代理均采用以下格式

cat proxy.csv
104.131.6.78:80
104.197.16.8:3128
104.131.94.221:8080
63.110.242.67:3128
我根据网页写了我的涂鸦器。

这是我的框架结构--test.py

使用
scrapy runspider test.py运行爬行器时会出现错误信息

连接被另一方拒绝:111:连接被拒绝

使用从
proxybroker
获得的相同代理,我使用自己的方式下载url集,而不是使用scrapy。
为了简单起见,所有损坏的代理ip都将保留,而不是删除。
下面的代码片段用于测试是否可以使用代理ip而不是完全下载url集。
程序结构如下所示

import time
import csv,os,urllib.request
data_dir = "/tmp/"

urls = set #omit how to get it.

csvfile = open(data_dir + 'proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader] 
ip_len = len(ippool)
ipth = 0 

for ith,item in enumerate(urls):
    time.sleep(2)
    flag = 1
    if ipth >= ip_len : ipth =0 
    while(ipth <ip_len and flag == 1):
        try : 
            handler = urllib.request.ProxyHandler({'http':ippool[ipth]})  
            opener = urllib.request.build_opener(handler)
            urllib.request.install_opener(opener)  
            response = urllib.request.urlopen(urls[ith]).read().decode("utf8")
            fh = open(data_dir + str(ith),"w")
            fh.write(response)
            fh.close()
            ipth = ipth + 1 
            flag = 0
            print(urls[ith] + "downloaded")
        except :
            print("can not downloaded" + urls[ith]) 
导入时间
导入csv、os、urllib.request
data_dir=“/tmp/”
URL=set#省略如何获取它。
csvfile=open(数据目录+proxy.csv)
reader=csv.reader(csvfile)
ippool=[读卡器中的行的行[0]
ip_len=len(ippool)
ipth=0
对于第i个,枚举中的项(URL):
时间。睡眠(2)
标志=1
如果ipth>=ip\u len:ipth=0
而(ipth尝试使用

Settings.py
中,可以进行如下更改:

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

希望这能帮助你,因为这也解决了我的问题。

这显然意味着代理不起作用……你买了吗?如果这些是免费的,那就别指望它们能起作用。大多数都能起作用。我已经使用scrapy with proxy两年多了,根据我的经验,错误意味着你的代理不允许你连接
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"