Python 如何跳过重复的行？_Python_Python 3.x_Regex_Beautifulsoup

Python 如何跳过重复的行？

python python-3.x regex

Python 如何跳过重复的行？,python,python-3.x,regex,beautifulsoup,Python,Python 3.x,Regex,Beautifulsoup,如何使所描述的部分无法测试重复链接？相比之下，我试着去做，但我做不到，剧本写得很慢 import re from bs4 import BeautifulSoup import requests import urllib.request r = requests.get( 'http://www.google.com' ) html = r.text soup = BeautifulSoup( html , 'lxml' ) links = soup.find_all( 'a' , att

如何使所描述的部分无法测试重复链接？相比之下，我试着去做，但我做不到，剧本写得很慢

import re
from bs4 import BeautifulSoup
import requests
import urllib.request

r = requests.get( 'http://www.google.com' )
html = r.text
soup = BeautifulSoup( html , 'lxml' )
links = soup.find_all( 'a' , attrs={'href' : re.compile( r'^https?://' )} )
for i in links :
    href = i['href']

# Test Section
    req = requests.get( href )
    resp = req.status_code
    if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503] :
        print( resp + '=' + resp.reason + '===>' + href )
        with open( 'Document_ERROR.txt' , 'a' ) as arq :
           arq.write( href )
           arq.write( '\n' )
           arq.write( resp.reason )
           arq.close( )
    else :
       print( 'Response is {} ===> `{}'.format( resp , href ) )
       with open( 'Document_OK.txt' , 'a' ) as arq :
          arq.write( href )
          arq.write( '\n' )
          arq.close( )

如果我理解正确，您希望跳过测试代码，因为您已经测试了链接

您可以有一个名为

seen\u links

的集合，该集合将保存到目前为止测试的所有链接：

import re
from bs4 import BeautifulSoup
import requests
import urllib.request


r = requests.get('http://www.google.com')
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all('a',attrs={'href': re.compile( r'^https?://' )})


seen_links = set()  # <-- set that will hold all seen links so far

for i in links :
    href = i['href']

    # have we seen the link before?
    if href in seen_links:
        continue    # yes, continue the loop

    # no, add it to seen_links
    seen_links.add(href)

    req = requests.get( href )
    resp = req.status_code
    if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503]:
        print( resp + '=' + resp.reason + '===>' + href )
        with open( 'Document_ERROR.txt' , 'a' ) as arq :
            print(href, file=arq)
            print(resp.reason, file=arq)
    else :
        print( 'Response is {} ===> `{}'.format( resp , href ) )
        with open( 'Document_OK.txt' , 'a' ) as arq :
            print(href, file=arq)

重新导入
从bs4导入BeautifulSoup
导入请求
导入urllib.request
r=请求。获取（'http://www.google.com')
汤=BeautifulSoup（r.含量，'lxml'）
links=soup.find_all（'a'，attrs={'href'：re.compile（r'^https？：//'））
seen_links=set（）#