Python 如何跳过重复的行?
如何使所描述的部分无法测试重复链接? 相比之下,我试着去做,但我做不到,剧本写得很慢Python 如何跳过重复的行?,python,python-3.x,regex,beautifulsoup,Python,Python 3.x,Regex,Beautifulsoup,如何使所描述的部分无法测试重复链接? 相比之下,我试着去做,但我做不到,剧本写得很慢 import re from bs4 import BeautifulSoup import requests import urllib.request r = requests.get( 'http://www.google.com' ) html = r.text soup = BeautifulSoup( html , 'lxml' ) links = soup.find_all( 'a' , att
import re
from bs4 import BeautifulSoup
import requests
import urllib.request
r = requests.get( 'http://www.google.com' )
html = r.text
soup = BeautifulSoup( html , 'lxml' )
links = soup.find_all( 'a' , attrs={'href' : re.compile( r'^https?://' )} )
for i in links :
href = i['href']
# Test Section
req = requests.get( href )
resp = req.status_code
if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503] :
print( resp + '=' + resp.reason + '===>' + href )
with open( 'Document_ERROR.txt' , 'a' ) as arq :
arq.write( href )
arq.write( '\n' )
arq.write( resp.reason )
arq.close( )
else :
print( 'Response is {} ===> `{}'.format( resp , href ) )
with open( 'Document_OK.txt' , 'a' ) as arq :
arq.write( href )
arq.write( '\n' )
arq.close( )
如果我理解正确,您希望跳过测试代码,因为您已经测试了链接 您可以有一个名为
seen\u links
的集合,该集合将保存到目前为止测试的所有链接:
import re
from bs4 import BeautifulSoup
import requests
import urllib.request
r = requests.get('http://www.google.com')
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all('a',attrs={'href': re.compile( r'^https?://' )})
seen_links = set() # <-- set that will hold all seen links so far
for i in links :
href = i['href']
# have we seen the link before?
if href in seen_links:
continue # yes, continue the loop
# no, add it to seen_links
seen_links.add(href)
req = requests.get( href )
resp = req.status_code
if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503]:
print( resp + '=' + resp.reason + '===>' + href )
with open( 'Document_ERROR.txt' , 'a' ) as arq :
print(href, file=arq)
print(resp.reason, file=arq)
else :
print( 'Response is {} ===> `{}'.format( resp , href ) )
with open( 'Document_OK.txt' , 'a' ) as arq :
print(href, file=arq)
重新导入
从bs4导入BeautifulSoup
导入请求
导入urllib.request
r=请求。获取('http://www.google.com')
汤=BeautifulSoup(r.含量,'lxml')
links=soup.find_all('a',attrs={'href':re.compile(r'^https?://'))
seen_links=set()#