Web scraping 无法在python 3.7中使用beautifulsoup获取文章内容
我正在使用python 3.7中的beautifulsoup进行web抓取。下面的代码成功地抓取了日期、标题、标签,但没有抓取文章的内容。相反,它什么也不给Web scraping 无法在python 3.7中使用beautifulsoup获取文章内容,web-scraping,beautifulsoup,python-3.7,Web Scraping,Beautifulsoup,Python 3.7,我正在使用python 3.7中的beautifulsoup进行web抓取。下面的代码成功地抓取了日期、标题、标签,但没有抓取文章的内容。相反,它什么也不给 import time import requests from bs4 import BeautifulSoup from bs4.element import Tag url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publi
import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publishdate&ct=text&page={}'
pages = 32
for page in range(4, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("a", {"class": "story-card75x1-text"}, href=True):
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.thehindu.com"+_href)
except Exception as e:
continue
dateTag = soup.find("span", {"class": "dateline"})
sauce = BeautifulSoup(resp.text,"lxml")
tag = sauce.find("a", {"class": "section-name"})
titleTag = sauce.find("h1", {"class": "title"})
contentTag = sauce.find("div", {"class": "_yeti_done"})
date = None
tagName = None
title = None
content = None
if isinstance(dateTag,Tag):
date = dateTag.get_text().strip()
if isinstance(tag,Tag):
tagName = tag.get_text().strip()
if isinstance(titleTag,Tag):
title = titleTag.get_text().strip()
if isinstance(contentTag,Tag):
content = contentTag.get_text().strip()
print(f'{date}\n {tagName}\n {title}\n {content}\n')
time.sleep(3)
在contentTag中编写正确的类时,我看不出问题出在哪里
谢谢。我猜您希望从第一页到其内部页面的链接以.ece结尾。我已经在脚本中应用了这种逻辑来遍历那些目标页面,从中获取数据。我对内容选择器的定义略有不同。现在它似乎工作正常。下面的脚本仅从第1页刮取数据。请根据您的要求随意更改
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publishdate&ct=text&page=1'
base = "https://www.thehindu.com"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".story-card-news a[href$='.ece']"):
resp = requests.get(urljoin(base,item.get("href")))
sauce = BeautifulSoup(resp.text,"lxml")
title = item.get_text(strip=True)
content = ' '.join([item.get_text(strip=True) for item in sauce.select("[id^='content-body-'] p")])
print(f'{title}\n {content}\n')
重新检查页面来源,您的靓汤正在阅读。页面源甚至可能不包含所需的标记。