删除Python中的重复URL,包括包含正斜杠的URL
下面的程序为我提供的输出包括带或不带正斜杠的URL(例如ask.census.gov和ask.census.gov/)。我需要消除其中一个。提前感谢您的帮助删除Python中的重复URL,包括包含正斜杠的URL,python,web-scraping,duplicates,Python,Web Scraping,Duplicates,下面的程序为我提供的输出包括带或不带正斜杠的URL(例如ask.census.gov和ask.census.gov/)。我需要消除其中一个。提前感谢您的帮助 from bs4 import BeautifulSoup as mySoup from urllib.parse import urljoin as myJoin from urllib.request import urlopen as myRequest my_url = "https://www.census.gov/progra
from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest
my_url = "https://www.census.gov/programs-surveys/popest.html"
# call on packages
html_page = myRequest(my_url)
raw_html = html_page.read()
html_page.close()
page_soup = mySoup(raw_html, "html.parser")
f = open("censusTest.csv", "w")
hyperlinks = page_soup.findAll('a')
set_urls = set()
for checked in hyperlinks:
found_link = checked.get("href")
result_set = myJoin(my_url, found_link)
if result_set and result_set not in set_urls:
set_urls.add(result_set)
f.write(str(result_set) + "\n")
f.close()
这段代码将检查字符串中的最后一个字符是否是“/”,如果是,它将删除它
python字符串操作的好例子:
您可以随时-如果存在,则会将其删除,如果不存在,则不会执行任何操作:
result_set = myJoin(my_url, found_link).rstrip("/")
执行此代码后,
我的url
是否会变得等于/
?alecxe是对的,我已修复了我的错误。非常感谢。
result_set = myJoin(my_url, found_link).rstrip("/")