Python 如何识别从网页中抓取时要指定的HTML标记或类？_Python_Html_Web Scraping_Beautifulsoup

Python 如何识别从网页中抓取时要指定的HTML标记或类？

python html web-scraping

Python 如何识别从网页中抓取时要指定的HTML标记或类？,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我想在网站上搜索新闻链接（在下面的截图中突出显示）：当我检查页面时，我看到我想要的链接包含在标签h5下的类col-sm-5中。我想清除该div类中的所有4个链接（带有标记li），即col-sm-5。因此，我编写了以下代码来提取链接： import requests page = requests.get("http://www3.asiainsurancereview.com/News","html.parser") soup = BeautifulSoup(page.text, "htm

我想在网站上搜索新闻链接（在下面的截图中突出显示）：

当我检查页面时，我看到我想要的链接包含在标签

h5

下的类

col-sm-5

中。我想清除该div类中的所有4个链接（带有标记

li

），即col-sm-5。因此，我编写了以下代码来提取链接：

import requests 
page = requests.get("http://www3.asiainsurancereview.com/News","html.parser")
soup = BeautifulSoup(page.text, "html.parser")
li_box = soup.find('h5', attrs={'class': 'col_sm_5'})
print(li_box)

但是我得到的输出是

none

；我猜它找不到标签了。因此，我的问题是，如何指定查找和提取链接所需的类、标记或其他信息？

requests.get（）

不需要

“html.parser”

，这是用于beautifulsoup的

此外，类名是

col-sm-5

而不是

col\u-sm\u 5

最好使用响应

内容

而不是

文本

。（可能不是真的，请参见评论）

您可以使用css选择器，如下所示：

import requests
from bs4 import BeautifulSoup

page = requests.get("http://www3.asiainsurancereview.com/News")
soup = BeautifulSoup(page.content, "html.parser")
li_box = soup.select('div.col-sm-5 > ul > li > h5 > a')
for link in li_box:
    print(link['href'])

输出：

/Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law
/Mock-News-Article/id/42946/Type/eDaily/India-M-A-deals-brewing-in-insurance-sector
/Mock-News-Article/id/42947/Type/eDaily/China-Online-insurance-premiums-soar-31-in-1Q2018
/Mock-News-Article/id/42948/Type/eDaily/South-Korea-Courts-increasingly-see-65-as-retirement-age

您正在尝试访问页面的HTML中不存在的元素

li_box = soup.find('h5', attrs={'class': 'col_sm_5'})

在这一行中，您试图获取h5标记，该标记的类为“col_sm_5”，在页面的HTML中不存在。在HTML中，只存在类为“col-sm-5”的“div”

现在是解决方案。最简单的方法是使用beautifulSoup的select（）

>>> page = requests.get("http://www3.asiainsurancereview.com/News","html.parser")
>>> soup = BeautifulSoup(page.content, "html.parser")
>>> aa = soup.select("div.col-sm-5 ul.list-default li h5 a")
>>> for a in aa:
...     print(a.attrs['href'])
...
/Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law
/Mock-News-Article/id/42946/Type/eDaily/India-M-A-deals-brewing-in-insurance-sector
/Mock-News-Article/id/42947/Type/eDaily/China-Online-insurance-premiums-soar-31-in-1Q2018
/Mock-News-Article/id/42948/Type/eDaily/South-Korea-Courts-increasingly-see- 65-as-retirement-age
>>>

soup.select将在类别为col-sm-5的div内的li内找到h5内的所有a标签

然后遍历所有元素并获取所需的attr，在您的示例中为href

h5标签不存在任何类别

试试这个,

select_div = soup.findAll('div', {'class': 'col-sm-5'})
result = []
for each_div in select_div:
    links = each_div.findAll('a');
    for each_tag in links:
        link = each_tag.attrs['href']       
        result.append(str(link))

print(result)

输出将是url的列表

[“/Mock News Article/id/42945/Type/edaiy/新西兰政府开始保险法审查的公众咨询阶段”、“/Mock News Article/id/42946/Type/edaiy/India-M-A-deals-brewing-in-insurance-sector”、“/Mock News Article/id/42947/Type/edaiy/China-Online-insurance-premium-soar-31-in-1Q2018”、”/Mock News-Article/id/42948/Type/eDaily/South-Korea-Courts-see-65-as-退休年龄']

最好使用response

content

而不是

text

-为什么？不是这样。看看问题和答案。在解析HTML时使用

.text.

，在下载文件或相关内容时（当您需要二进制数据时）使用

.content

。因此，最好在这里使用

text

。