Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何识别从网页中抓取时要指定的HTML标记或类?_Python_Html_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 如何识别从网页中抓取时要指定的HTML标记或类?

Python 如何识别从网页中抓取时要指定的HTML标记或类?,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我想在网站上搜索新闻链接(在下面的截图中突出显示): 当我检查页面时,我看到我想要的链接包含在标签h5下的类col-sm-5中。我想清除该div类中的所有4个链接(带有标记li),即col-sm-5。因此,我编写了以下代码来提取链接: import requests page = requests.get("http://www3.asiainsurancereview.com/News","html.parser") soup = BeautifulSoup(page.text, "htm

我想在网站上搜索新闻链接(在下面的截图中突出显示):

当我检查页面时,我看到我想要的链接包含在标签
h5
下的类
col-sm-5
中。我想清除该div类中的所有4个链接(带有标记
li
),即col-sm-5。因此,我编写了以下代码来提取链接:

import requests 
page = requests.get("http://www3.asiainsurancereview.com/News","html.parser")
soup = BeautifulSoup(page.text, "html.parser")
li_box = soup.find('h5', attrs={'class': 'col_sm_5'})
print(li_box) 
但是我得到的输出是
none
;我猜它找不到标签了。因此,我的问题是,如何指定查找和提取链接所需的类、标记或其他信息?

requests.get()
不需要
“html.parser”
,这是用于beautifulsoup的

此外,类名是
col-sm-5
而不是
col\u-sm\u 5

最好使用响应
内容
而不是
文本
。(可能不是真的,请参见评论)

您可以使用css选择器,如下所示:

import requests
from bs4 import BeautifulSoup

page = requests.get("http://www3.asiainsurancereview.com/News")
soup = BeautifulSoup(page.content, "html.parser")
li_box = soup.select('div.col-sm-5 > ul > li > h5 > a')
for link in li_box:
    print(link['href'])
输出:

/Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law
/Mock-News-Article/id/42946/Type/eDaily/India-M-A-deals-brewing-in-insurance-sector
/Mock-News-Article/id/42947/Type/eDaily/China-Online-insurance-premiums-soar-31-in-1Q2018
/Mock-News-Article/id/42948/Type/eDaily/South-Korea-Courts-increasingly-see-65-as-retirement-age

您正在尝试访问页面的HTML中不存在的元素

li_box = soup.find('h5', attrs={'class': 'col_sm_5'})
在这一行中,您试图获取h5标记,该标记的类为“col_sm_5”,在页面的HTML中不存在。在HTML中,只存在类为“col-sm-5”的“div”

现在是解决方案。最简单的方法是使用beautifulSoup的select()

>>> page = requests.get("http://www3.asiainsurancereview.com/News","html.parser")
>>> soup = BeautifulSoup(page.content, "html.parser")
>>> aa = soup.select("div.col-sm-5 ul.list-default li h5 a")
>>> for a in aa:
...     print(a.attrs['href'])
...
/Mock-News-Article/id/42945/Type/eDaily/New-Zealand-Govt-starts-public-consultation-phase-of-review-of-insurance-law
/Mock-News-Article/id/42946/Type/eDaily/India-M-A-deals-brewing-in-insurance-sector
/Mock-News-Article/id/42947/Type/eDaily/China-Online-insurance-premiums-soar-31-in-1Q2018
/Mock-News-Article/id/42948/Type/eDaily/South-Korea-Courts-increasingly-see- 65-as-retirement-age
>>>
soup.select将在类别为col-sm-5的div内的li内找到h5内的所有a标签

然后遍历所有元素并获取所需的attr,在您的示例中为href

h5标签不存在任何类别

试试这个,

select_div = soup.findAll('div', {'class': 'col-sm-5'})
result = []
for each_div in select_div:
    links = each_div.findAll('a');
    for each_tag in links:
        link = each_tag.attrs['href']       
        result.append(str(link))

print(result)
输出将是url的列表

[“/Mock News Article/id/42945/Type/edaiy/新西兰政府开始保险法审查的公众咨询阶段”、“/Mock News Article/id/42946/Type/edaiy/India-M-A-deals-brewing-in-insurance-sector”、“/Mock News Article/id/42947/Type/edaiy/China-Online-insurance-premium-soar-31-in-1Q2018”、”/Mock News-Article/id/42948/Type/eDaily/South-Korea-Courts-see-65-as-退休年龄']

最好使用response
content
而不是
text
-为什么?不是这样。看看问题和答案。在解析HTML时使用
.text.
,在下载文件或相关内容时(当您需要二进制数据时)使用
.content
。因此,最好在这里使用
text