Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python3中从URL获取和解析代理_Python_Python 3.x_Web Scraping_Proxy_Http Proxy - Fatal编程技术网

在Python3中从URL获取和解析代理

在Python3中从URL获取和解析代理,python,python-3.x,web-scraping,proxy,http-proxy,Python,Python 3.x,Web Scraping,Proxy,Http Proxy,我试图从不同的代理列表网站获取和解析代理。 以下是我到目前为止得出的结论: #/usr/bin/python3 从TQM导入TQM 导入时间 导入系统 进口稀土 代理=[] def fetchAndParseProxies(url、自定义正则表达式): n=0 尝试: proxylist=requests.get(url,超时=15).text proxylist=proxylist.replace('null','N/A')) custom_regex=custom_regex.replace

我试图从不同的代理列表网站获取和解析代理。
以下是我到目前为止得出的结论:

#/usr/bin/python3
从TQM导入TQM
导入时间
导入系统
进口稀土
代理=[]
def fetchAndParseProxies(url、自定义正则表达式):
n=0
尝试:
proxylist=requests.get(url,超时=15).text
proxylist=proxylist.replace('null','N/A'))
custom_regex=custom_regex.replace(“%ip%”,“([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\.[0-9]{1,3})”
custom_regex=custom_regex.replace(“%port%”,“([0-9]{1,5})”
对于re.findall(re.compile(custom_regex),proxylist中的代理:
append(代理[0]+“:”+代理[1])
n+=1
除:
sys.stdout.write(“{0:>5}个从{1}\n.获取的代理。格式('0',url))
代理源=[
["http://spys.one/en“,“tr>%ip%%端口%(.*?{2}.*?*?*?(.*?.*?)”,
#["http://www.httptunnel.ge/ProxyListForFree.aspx“,“target=\”\u new\“>%ip%:%port%”],
#["https://www.us-proxy.org/“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
#["https://free-proxy-list.net/“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
#["https://www.sslproxies.org/“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
#["https://www.proxy-list.download/api/v0/get?l=en&t=https“,”“IP”:“%IP%”,“端口”:“%PORT%”,“]”,
#["https://api.proxyscrape.com/?request=getproxies&proxytype=http&timeout=5000&country=all&anonymity=elite&ssl=all“,%ip%:%port%”],
#["http://free-proxy.cz/en/proxylist/country/all/http/ping/level1“,%IP%%端口%(.*){2}.*?*?*?*(.*).*?。”,
["https://www.proxy-list.download/HTTPS“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
["https://www.proxy-list.download/HTTP“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
["http://www.freeproxylists.net/“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
["https://www.proxynova.com/proxy-server-list/“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
["http://www.freeproxylists.net/“,%ip%%端口%(.*){2}.*?*?*?*(.*).*?。”,
]
loop=tqdm(总计=len(proxysources),位置=0,离开=False)
对于proxysources中的源:
loop.set_说明('fetching…'))
fetchAndParseProxies(源[0],源[1])
循环更新(1)
loop.close()
打印(len(代理),“获取代理”)
我的输出:

0  Proxies Fetched.
正如您所看到的,问题是它说为未注释行获取了
0个代理
,尽管网站结构与我所看到的相同。我一定是在正则表达式中出错了,但我找不到在哪里

我将非常感谢您的帮助。
同时,我将继续查看它,并在有更新的情况下发布更新。

此脚本将从
http://sps.one/en
,但类似的方法可用于其他代理列表:

import requests
from bs4 import BeautifulSoup


ports_url = 'http://spys.one/proxy-port/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

soup = BeautifulSoup(requests.post(ports_url, headers=headers, data={'xpp': 5}).content, 'html.parser')
for f in soup.select('td[colspan="2"] > a > font.spy6'):
    u = 'http://spys.one/proxy-port/' + f.text + '/'
    s = BeautifulSoup(requests.post(u, headers=headers, data={'xpp': 5}).content, 'html.parser')
    for ff in s.select('tr > td:nth-child(1) > font.spy14'):
        print(ff.text)
印刷品:

81.17.131.61:8080
200.108.183.2:8080
105.209.182.128:8080
45.77.63.202:8080
94.158.152.54:8080
50.233.228.147:8080
142.44.148.56:8080
52.138.1.43:8080
68.183.202.221:8080
103.52.135.60:8080
104.238.174.173:8080
181.129.219.133:8080
183.89.147.40:8080
51.38.71.101:8080
103.112.61.162:8080
131.221.228.9:8080
49.0.65.246:8080
45.32.176.57:8080
104.238.185.153:8080
155.138.146.210:8080
203.76.124.35:8080
182.253.6.234:8080
36.90.93.20:8080
207.182.135.52:8080
165.16.109.50:8080
202.142.178.98:8080
103.123.246.66:8080
185.36.157.30:8080
103.104.213.227:8080
68.188.63.149:8080
136.244.113.206:3128
54.39.91.84:3128
198.13.36.75:3128
93.153.173.102:3128
161.35.110.112:3128

... and so on.

此脚本将从
http://sps.one/en
,但类似的方法可用于其他代理列表:

import requests
from bs4 import BeautifulSoup


ports_url = 'http://spys.one/proxy-port/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

soup = BeautifulSoup(requests.post(ports_url, headers=headers, data={'xpp': 5}).content, 'html.parser')
for f in soup.select('td[colspan="2"] > a > font.spy6'):
    u = 'http://spys.one/proxy-port/' + f.text + '/'
    s = BeautifulSoup(requests.post(u, headers=headers, data={'xpp': 5}).content, 'html.parser')
    for ff in s.select('tr > td:nth-child(1) > font.spy14'):
        print(ff.text)
印刷品:

81.17.131.61:8080
200.108.183.2:8080
105.209.182.128:8080
45.77.63.202:8080
94.158.152.54:8080
50.233.228.147:8080
142.44.148.56:8080
52.138.1.43:8080
68.183.202.221:8080
103.52.135.60:8080
104.238.174.173:8080
181.129.219.133:8080
183.89.147.40:8080
51.38.71.101:8080
103.112.61.162:8080
131.221.228.9:8080
49.0.65.246:8080
45.32.176.57:8080
104.238.185.153:8080
155.138.146.210:8080
203.76.124.35:8080
182.253.6.234:8080
36.90.93.20:8080
207.182.135.52:8080
165.16.109.50:8080
202.142.178.98:8080
103.123.246.66:8080
185.36.157.30:8080
103.104.213.227:8080
68.188.63.149:8080
136.244.113.206:3128
54.39.91.84:3128
198.13.36.75:3128
93.153.173.102:3128
161.35.110.112:3128

... and so on.

你能用
beautifulsoup
解析代理站点吗?@AndrejKesely我以前从未用过
beautifulsoup
,所以我不知道这是否可能,但我会看看,谢谢。你能用
beautifulsoup
解析代理站点吗?@AndrejKesely我以前从未用过
beautifulsoup
,所以我不知道这是否可能,但我会看一看,谢谢。谢谢你的回答,它运行顺利。可以用Beauty soup更改show dropbox中显示的代理数吗?@Unicyclist我更新了我的答案,现在每个端口显示500个代理。非常感谢!所以xpp对应于选择器的选项,如果我理解正确的话。@Unicyclist是的,它是发送到服务器的POST数据。你可以通过打开Firefox开发者工具->网络选项卡(Chrome也有类似的功能)来观察发送到服务器的内容。谢谢你的回答,它运行得很顺利。可以用Beauty soup更改show dropbox中显示的代理数吗?@Unicyclist我更新了我的答案,现在每个端口显示500个代理。非常感谢!所以xpp对应于选择器的选项,如果我理解正确的话。@Unicyclist是的,它是发送到服务器的POST数据。您可以通过打开Firefox开发者工具->网络选项卡(Chrome也有类似的功能)来观察发送到服务器的内容。