Python 用靓汤抓取新闻网站时获取文章内容的问题

Python 用靓汤抓取新闻网站时获取文章内容的问题,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从rss提要中抓取新闻文章,以及标题、描述、URL和日期等细节。 我并没有像预期的那样在描述栏中获得完整的文章内容。下面是我的代码 import requests from bs4 import BeautifulSoup as bs url='https://www.business-standard.com/rss/economy-policy-102.rss' resp= requests.get(url) soup = bs(resp.content,features='xml

我正试图从rss提要中抓取新闻文章,以及标题、描述、URL和日期等细节。 我并没有像预期的那样在描述栏中获得完整的文章内容。下面是我的代码

import requests
from bs4 import BeautifulSoup as bs

url='https://www.business-standard.com/rss/economy-policy-102.rss'
resp= requests.get(url)
soup = bs(resp.content,features='xml')
items= soup.findAll('item')
news_items = []

for item in items:
    news_item = {}
    news_item['title'] = item.title.text
    news_item['description'] = item.description.text
    news_item['link'] = item.link.text
    news_item['pubDate'] = item.pubDate.text
    news_items.append(news_item)

import pandas as pd
df = pd.DataFrame(news_items,columns=['title','description','link','pubDate'])
df['description'][0]

Output obtained - 'The re-import in the extended period would be without payment of basic customs duty and integrated goods and services tax'


如上所述,我没有得到完整的文章内容。应该做哪些更改?

RSS源不包含文章的全文,您必须打开链接并从那里获取文章

例如:

import requests
from bs4 import BeautifulSoup


url='https://www.business-standard.com/rss/economy-policy-102.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')

news_items = []
for item in soup.findAll('item'):
    news_item = {}
    news_item['title'] = item.title.text
    news_item['excerpt'] = item.description.text

    print(item.link.text)
    s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')

    news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
    news_item['link'] = item.link.text
    news_item['pubDate'] = item.pubDate.text
    news_items.append(news_item)

import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('data.csv')
创建
data.csv
(来自LibreOffice的屏幕截图):


此链接不起作用。前面的代码适用于URL的所有类型。是否应该更改每个网站的代码?@VarunS是的,当然,每个网站都有不同的结构,因此必须相应地更改代码。@Andrej Kesely我在文本列“span.p-content div[id^=“div gpt”]{行高:0;字体大小:0}中也得到了这些字符串。如何删除这些?@VarunS您可以使用
str.replace()
。例如
my_string=my_string.replace(“{行高:0;字体大小:0}”,”)