Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何跳过重复的行?_Python_Python 3.x_Regex_Beautifulsoup - Fatal编程技术网

Python 如何跳过重复的行?

Python 如何跳过重复的行?,python,python-3.x,regex,beautifulsoup,Python,Python 3.x,Regex,Beautifulsoup,如何使所描述的部分无法测试重复链接? 相比之下,我试着去做,但我做不到,剧本写得很慢 import re from bs4 import BeautifulSoup import requests import urllib.request r = requests.get( 'http://www.google.com' ) html = r.text soup = BeautifulSoup( html , 'lxml' ) links = soup.find_all( 'a' , att

如何使所描述的部分无法测试重复链接? 相比之下,我试着去做,但我做不到,剧本写得很慢

import re
from bs4 import BeautifulSoup
import requests
import urllib.request

r = requests.get( 'http://www.google.com' )
html = r.text
soup = BeautifulSoup( html , 'lxml' )
links = soup.find_all( 'a' , attrs={'href' : re.compile( r'^https?://' )} )
for i in links :
    href = i['href']

# Test Section
    req = requests.get( href )
    resp = req.status_code
    if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503] :
        print( resp + '=' + resp.reason + '===>' + href )
        with open( 'Document_ERROR.txt' , 'a' ) as arq :
           arq.write( href )
           arq.write( '\n' )
           arq.write( resp.reason )
           arq.close( )
    else :
       print( 'Response is {} ===> `{}'.format( resp , href ) )
       with open( 'Document_OK.txt' , 'a' ) as arq :
          arq.write( href )
          arq.write( '\n' )
          arq.close( )

如果我理解正确,您希望跳过测试代码,因为您已经测试了链接

您可以有一个名为
seen\u links
的集合,该集合将保存到目前为止测试的所有链接:

import re
from bs4 import BeautifulSoup
import requests
import urllib.request


r = requests.get('http://www.google.com')
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all('a',attrs={'href': re.compile( r'^https?://' )})


seen_links = set()  # <-- set that will hold all seen links so far

for i in links :
    href = i['href']

    # have we seen the link before?
    if href in seen_links:
        continue    # yes, continue the loop

    # no, add it to seen_links
    seen_links.add(href)

    req = requests.get( href )
    resp = req.status_code
    if resp is None or resp in [400 , 404 , 403 , 408 , 409 , 501 , 502 , 503]:
        print( resp + '=' + resp.reason + '===>' + href )
        with open( 'Document_ERROR.txt' , 'a' ) as arq :
            print(href, file=arq)
            print(resp.reason, file=arq)
    else :
        print( 'Response is {} ===> `{}'.format( resp , href ) )
        with open( 'Document_OK.txt' , 'a' ) as arq :
            print(href, file=arq)
重新导入
从bs4导入BeautifulSoup
导入请求
导入urllib.request
r=请求。获取('http://www.google.com')
汤=BeautifulSoup(r.含量,'lxml')
links=soup.find_all('a',attrs={'href':re.compile(r'^https?://'))
seen_links=set()#