Python：如何使用date从父页面获取URL？_Python_Html_Web Scraping_Beautifulsoup

Python：如何使用date从父页面获取URL？

python html web-scraping

Python：如何使用date从父页面获取URL？,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我是python新手。我有一个网站，有一个网站列表。我需要根据span style标记中的日期获取href。然后打开url，这样我就可以从中获取数据。我有用于子站点的刮板你如何阅读网站，找到日期，然后把html作为字典？我可以在一行中获取日期，在另一行中获取html列表 url = "https://www.visitmonmouth.com/page.aspx?Id=5017" html = urlopen(url) soup = BeautifulSoup(html,

我是python新手。我有一个网站，有一个网站列表。我需要根据span style标记中的日期获取href。然后打开url，这样我就可以从中获取数据。我有用于子站点的刮板

你如何阅读网站，找到日期，然后把html作为字典？我可以在一行中获取日期，在另一行中获取html列表

url = "https://www.visitmonmouth.com/page.aspx?Id=5017"
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser') 

tags = soup('a')
title = soup.title
print(title)
#get all HTML links.
for daily in tags:
    print(daily.get('href',None))
    c_date = soup.find_all(string=re.compile('7/18/20:'))
print(c_date)

试试这个：

从pprint导入pprint
进口bs4
导入请求
resp=requests.get（“https://www.visitmonmouth.com/page.aspx?Id=5017")
assert resp.status_code==200
soup=bs4.BeautifulSoup（分别是内容“html.parser”）
日期=[]
链接=[]
查找（'div'，{'id'：'content'}）。查找所有（'li'）：
date=str（tag.contents[0]）.strip（）.replace（'：'，''）.split（''）[0]
如果date.count（'/'）==2:#应在此处使用regexp。
a=标记。查找（'a'）
如果a不是无：
href=a.attrs['href'].strip（）
如果href.startswith（'http'）：#此处也应使用regexp。
打印（日期，href）
日期。附加（日期）
links.append（href）
my_dict=dict（zip（日期、链接））
pprint（我的字典，缩进=2）

输出将如下所示：

7/18/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2962
7/17/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2960
7/16/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2959
7/15/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2958
7/14/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2956
7/13/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2955
7/12/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2954

数据将在一个名为

my_dict

的字典中提供，请尝试以下操作：

从pprint导入pprint
进口bs4
导入请求
resp=requests.get（“https://www.visitmonmouth.com/page.aspx?Id=5017")
assert resp.status_code==200
soup=bs4.BeautifulSoup（分别是内容“html.parser”）
日期=[]
链接=[]
查找（'div'，{'id'：'content'}）。查找所有（'li'）：
date=str（tag.contents[0]）.strip（）.replace（'：'，''）.split（''）[0]
如果date.count（'/'）==2:#应在此处使用regexp。
a=标记。查找（'a'）
如果a不是无：
href=a.attrs['href'].strip（）
如果href.startswith（'http'）：#此处也应使用regexp。
打印（日期，href）
日期。附加（日期）
links.append（href）
my_dict=dict（zip（日期、链接））
pprint（我的字典，缩进=2）

输出将如下所示：

7/18/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2962
7/17/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2960
7/16/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2959
7/15/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2958
7/14/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2956
7/13/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2955
7/12/20 https://www.co.monmouth.nj.us/PressDetail.aspx?ID=2954

这些数据将在一本名为《我的字典》的字典中找到，非常感谢！这很有帮助。我会按照你的建议研究正则表达式。我已经用字典更新了我的答案。我很高兴你发现它很有用！非常感谢！这很有帮助。我会按照你的建议研究正则表达式。我已经用字典更新了我的答案。我很高兴你发现它很有用！