Python 使用BeautifulSoup4抓取网页_Python_Web Scraping_Beautifulsoup

Python 使用BeautifulSoup4抓取网页

python web-scraping

Python 使用BeautifulSoup4抓取网页,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图用BeautifulSoup4打印一篇新闻文章的内容网址是：我拥有的当前代码如下所示，给出了所需的输出： page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece') soup = BeautifulSoup(page.conte

我正试图用BeautifulSoup4打印一篇新闻文章的内容

网址是：

我拥有的当前代码如下所示，给出了所需的输出：

page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')


article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})                              

for element in table:
    article_text += ''.join(element.find_all(text = True)) + "\n\n"

print(article_text)

但是，问题是我想刮取多个页面，每个页面都有不同的内容正文编号，格式为xxxxxxxx-xxxxxxxx（2块8位数）

我尝试将soup.find_all命令替换为regex，如下所示：

table=soup.find_all（text=re.compile（“内容体-……”））

但这会产生一个错误：

AttributeError:“NavigableString”对象没有“find_all”属性

有人能告诉我需要做什么吗

谢谢。

您可以使用lxml提取内容 lxml库允许您使用xpath从html中提取内容

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

我不使用BeautifulSoup。我想你可以像这样使用BeautifulSoup

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

然后使用find-child元素，找到第一个div元素

您可以使用lxml来提取内容 lxml库允许您使用xpath从html中提取内容

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

我不使用BeautifulSoup。我想你可以像这样使用BeautifulSoup

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

然后使用find-child元素，查找第一个div元素，正则表达式应该可以！试一试

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

正则表达式应该很好！试一试

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

另一种方法可能是使用css选择器。选择器干净且切中要害。你也可以试试看。只需更改“url”与您的相关链接

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

另一种方法可能是使用css选择器。选择器干净且切中要害。你也可以试试看。只需更改“url”与您的相关链接

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

你想将a href的所有链接打印为输出，对吗？不，我正在打印文章的文本。soup.find_text（）为我提供了整个文本，而我需要的内容嵌入到a中的多个元素中，id为content-body-xxxxxxxx-xxxxxxxx。是否希望a href的所有链接都打印为输出？否。我正在尝试打印文章的文本。soup.find_text（）为我提供了整个文本，而我需要的内容则嵌入到id为content-body-xxxxxxxx-xxxxxxxx的多个元素中。嘿，谢谢。soup.find（'div'，attrs={“id”：re.compile（“content body-…-…-…）}）。find_all（“p”）起作用了。嘿，谢谢。find（'div'，attrs={“id”：re.compile（“content body-…-…-…））}）。find_all（“p”）工作正常。