在Python3中从URL获取和解析代理_Python_Python 3.x_Web Scraping_Proxy_Http Proxy

在Python3中从URL获取和解析代理

python python-3.x web-scraping proxy

在Python3中从URL获取和解析代理,python,python-3.x,web-scraping,proxy,http-proxy,Python,Python 3.x,Web Scraping,Proxy,Http Proxy,我试图从不同的代理列表网站获取和解析代理。以下是我到目前为止得出的结论： #/usr/bin/python3 从TQM导入TQM 导入时间导入系统进口稀土代理=[] def fetchAndParseProxies（url、自定义正则表达式）： n=0 尝试： proxylist=requests.get（url，超时=15）.text proxylist=proxylist.replace（'null'，'N/A'）） custom_regex=custom_regex.replace

我试图从不同的代理列表网站获取和解析代理。
以下是我到目前为止得出的结论：

#/usr/bin/python3
从TQM导入TQM
导入时间
导入系统
进口稀土
代理=[]
def fetchAndParseProxies（url、自定义正则表达式）：
n=0
尝试：
proxylist=requests.get（url，超时=15）.text
proxylist=proxylist.replace（'null'，'N/A'））
custom_regex=custom_regex.replace（“%ip%”，“（[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\.[0-9]{1,3}）”
custom_regex=custom_regex.replace（“%port%”，“（[0-9]{1,5}）”
对于re.findall（re.compile（custom_regex），proxylist中的代理：
append（代理[0]+“：”+代理[1]）
n+=1
除：
sys.stdout.write（“{0:>5}个从{1}\n.获取的代理。格式（'0'，url））
代理源=[
["http://spys.one/en“，“tr>%ip%%端口%（.*？{2}.*？*？*？（.*？.*？）”，
#["http://www.httptunnel.ge/ProxyListForFree.aspx“，“target=\”\u new\“>%ip%：%port%”]，
#["https://www.us-proxy.org/“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
#["https://free-proxy-list.net/“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
#["https://www.sslproxies.org/“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
#["https://www.proxy-list.download/api/v0/get?l=en&t=https“，”“IP”：“%IP%”，“端口”：“%PORT%”，“]”，
#["https://api.proxyscrape.com/?request=getproxies&proxytype=http&timeout=5000&country=all&anonymity=elite&ssl=all“，%ip%：%port%”]，
#["http://free-proxy.cz/en/proxylist/country/all/http/ping/level1“，%IP%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
["https://www.proxy-list.download/HTTPS“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
["https://www.proxy-list.download/HTTP“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
["http://www.freeproxylists.net/“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
["https://www.proxynova.com/proxy-server-list/“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
["http://www.freeproxylists.net/“，%ip%%端口%（.*）{2}.*？*？*？*（.*）.*？。”，
]
loop=tqdm（总计=len（proxysources），位置=0，离开=False）
对于proxysources中的源：
loop.set_说明（'fetching…'））
fetchAndParseProxies（源[0]，源[1]）
循环更新（1）
loop.close（）
打印（len（代理），“获取代理”）

我的输出：

0  Proxies Fetched.

正如您所看到的，问题是它说为未注释行获取了

0个代理

，尽管网站结构与我所看到的相同。我一定是在正则表达式中出错了，但我找不到在哪里

我将非常感谢您的帮助。

同时，我将继续查看它，并在有更新的情况下发布更新。

此脚本将从

http://sps.one/en

，但类似的方法可用于其他代理列表：

import requests
from bs4 import BeautifulSoup


ports_url = 'http://spys.one/proxy-port/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

soup = BeautifulSoup(requests.post(ports_url, headers=headers, data={'xpp': 5}).content, 'html.parser')
for f in soup.select('td[colspan="2"] > a > font.spy6'):
    u = 'http://spys.one/proxy-port/' + f.text + '/'
    s = BeautifulSoup(requests.post(u, headers=headers, data={'xpp': 5}).content, 'html.parser')
    for ff in s.select('tr > td:nth-child(1) > font.spy14'):
        print(ff.text)

印刷品：

81.17.131.61:8080
200.108.183.2:8080
105.209.182.128:8080
45.77.63.202:8080
94.158.152.54:8080
50.233.228.147:8080
142.44.148.56:8080
52.138.1.43:8080
68.183.202.221:8080
103.52.135.60:8080
104.238.174.173:8080
181.129.219.133:8080
183.89.147.40:8080
51.38.71.101:8080
103.112.61.162:8080
131.221.228.9:8080
49.0.65.246:8080
45.32.176.57:8080
104.238.185.153:8080
155.138.146.210:8080
203.76.124.35:8080
182.253.6.234:8080
36.90.93.20:8080
207.182.135.52:8080
165.16.109.50:8080
202.142.178.98:8080
103.123.246.66:8080
185.36.157.30:8080
103.104.213.227:8080
68.188.63.149:8080
136.244.113.206:3128
54.39.91.84:3128
198.13.36.75:3128
93.153.173.102:3128
161.35.110.112:3128

... and so on.

此脚本将从

http://sps.one/en

，但类似的方法可用于其他代理列表：

import requests
from bs4 import BeautifulSoup


ports_url = 'http://spys.one/proxy-port/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}

soup = BeautifulSoup(requests.post(ports_url, headers=headers, data={'xpp': 5}).content, 'html.parser')
for f in soup.select('td[colspan="2"] > a > font.spy6'):
    u = 'http://spys.one/proxy-port/' + f.text + '/'
    s = BeautifulSoup(requests.post(u, headers=headers, data={'xpp': 5}).content, 'html.parser')
    for ff in s.select('tr > td:nth-child(1) > font.spy14'):
        print(ff.text)

印刷品：

81.17.131.61:8080
200.108.183.2:8080
105.209.182.128:8080
45.77.63.202:8080
94.158.152.54:8080
50.233.228.147:8080
142.44.148.56:8080
52.138.1.43:8080
68.183.202.221:8080
103.52.135.60:8080
104.238.174.173:8080
181.129.219.133:8080
183.89.147.40:8080
51.38.71.101:8080
103.112.61.162:8080
131.221.228.9:8080
49.0.65.246:8080
45.32.176.57:8080
104.238.185.153:8080
155.138.146.210:8080
203.76.124.35:8080
182.253.6.234:8080
36.90.93.20:8080
207.182.135.52:8080
165.16.109.50:8080
202.142.178.98:8080
103.123.246.66:8080
185.36.157.30:8080
103.104.213.227:8080
68.188.63.149:8080
136.244.113.206:3128
54.39.91.84:3128
198.13.36.75:3128
93.153.173.102:3128
161.35.110.112:3128

... and so on.

你能用

beautifulsoup

解析代理站点吗？@AndrejKesely我以前从未用过

beautifulsoup

，所以我不知道这是否可能，但我会看看，谢谢。你能用

beautifulsoup

解析代理站点吗？@AndrejKesely我以前从未用过

beautifulsoup

，所以我不知道这是否可能，但我会看一看，谢谢。谢谢你的回答，它运行顺利。可以用Beauty soup更改show dropbox中显示的代理数吗？@Unicyclist我更新了我的答案，现在每个端口显示500个代理。非常感谢！所以xpp对应于选择器的选项，如果我理解正确的话。@Unicyclist是的，它是发送到服务器的POST数据。你可以通过打开Firefox开发者工具->网络选项卡（Chrome也有类似的功能）来观察发送到服务器的内容。谢谢你的回答，它运行得很顺利。可以用Beauty soup更改show dropbox中显示的代理数吗？@Unicyclist我更新了我的答案，现在每个端口显示500个代理。非常感谢！所以xpp对应于选择器的选项，如果我理解正确的话。@Unicyclist是的，它是发送到服务器的POST数据。您可以通过打开Firefox开发者工具->网络选项卡（Chrome也有类似的功能）来观察发送到服务器的内容。