Python 由于网页上的广告，使用BeautifulSoup删除网站会产生大量空白_Python_Web Scraping_Beautifulsoup

Python 由于网页上的广告，使用BeautifulSoup删除网站会产生大量空白

python web-scraping

Python 由于网页上的广告，使用BeautifulSoup删除网站会产生大量空白,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,这是我试图抓取的链接，作为示例：以下是尝试实现该功能的函数： t = [] try: temp = [] data = bs.find_all(class_=['contentSec']) # logging.info(data) for i in data: temp = temp + (i.find_all('p')) for i in temp: t.append(i.get_text()) except

这是我试图抓取的链接，作为示例：

以下是尝试实现该功能的函数：

    t = []
try:
    temp = []
    data = bs.find_all(class_=['contentSec'])
    # logging.info(data)
    for i in data:
        temp = temp + (i.find_all('p'))
    for i in temp:
        t.append(i.get_text())
except Exception as e:
    print(e)
return t

发生的情况是，如果我在find all参数中包含text=True，它会忽略带有链接（带有href标记）的段落。否则，它会在内容字段中给我留下巨大的空白，可能是因为网站上的广告也在para标签中。我已经附上了样本输出

我遗漏了什么？

您正在查找的数据（即文章内容）可直接在

div

下的页面源代码中，使用class

main区域

。你所要做的就是得到那个div的文本并把它清理干净。对于您需要的数据，我认为根本不需要找到

script

标记并使用

json

模块。但如果您需要datePublished等数据，@chitown88的答案更全面

import requests
import bs4
import pandas as pd
import json

list_of_urls = ['https://www.livemint.com/Companies/Ot1UTmQ8EMe0DTWSiJCgfJ/Google-teams-with-HDFC-Bank-ICICI-others-for-instant-loans.html']

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}

results_df = pd.DataFrame()


for url in list_of_urls:
    response = response = requests.get(url, headers=headers)
    soup = bs4.BeautifulSoup(response.text, 'html.parser')

    scripts = soup.find_all('script',{'type':'application/ld+json'})
    for script in scripts:
        if '"headline"' in script.text:
            jsonStr = script.text.strip()
            jsonObj = json.loads(jsonStr)

            date_pub = jsonObj['datePublished']
            date_mod = jsonObj['dateModified']
            data = jsonObj['articleBody']
            url = jsonObj['url']

            temp_df = pd.DataFrame([[date_pub, data, url]], columns=['date_published','data','url'])

            results_df = results_df.append(temp_df)

results_df = results_df.reset_index(drop=True)
results_df.to_csv('path/to/file.csv', index=False)

from bs4 import BeautifulSoup
import requests
url='https://www.livemint.com/Companies/Ot1UTmQ8EMe0DTWSiJCgfJ/Google-teams-with-HDFC-Bank-ICICI-others-for-instant-loans.html'
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
data=soup.find('div',class_='mainArea')
#let's just clen up the data
cleaned_data=data.text.split('\n\n')[0].strip()
print(cleaned_data)

输出

New Delhi: Google India on Tuesday said it has rebranded its Indian payments app Tez as Google Pay and is partnering four banks to provide instant loans for the app’s users. In the coming weeks, Google Pay users will be able to access customised loans from HDFC Bank Ltd, ICICI Bank Ltd, Federal Bank and Kotak Mahindra Bank Ltd with minimal paperwork, said Caesar Sengupta, vice-president of Google’s Next Billion Users Initiative and Payments, at the Google for India event in New Delhi. Once users holding accounts with these banks accept the bank’s terms, the money will be transferred to their accounts.“We have learnt that when we build for India, we build for the world, and we believe that many of the innovations and features we have pioneered with Tez will work globally," Caesar Sengupta said.Google Tez, which was launched in September, will also expand services for merchants and retailers. About 15,000 retail stores in India will have Google Pay enabled by Diwali 2018, Caesar Sengupta said.Google claims that over 1.2 million small businesses in India are already using Google Pay. In a bid to help their business grow further, Google is building a dedicated merchant experience where they will be discovered through Google Search and Maps, and communicate with their customers through messages and offers.“We are testing these features with merchants in Bangalore and Delhi, and on-boarding more neighbourhoods in the following months," said Sengupta.Google Pay has rivals in Paytm and Facebook Inc.’s WhatsApp targeting the Indian payments market. On Tuesday, Mint reported that Warren Buffett’s Berkshire Hathaway Inc. has sealed a deal with Paytm, marking the legendary investor’s first investment in the country. A string of other big-name players are also expanding in India’s digital payments market including its banks, India Post Payments Bank, and Mukesh Ambani’s Reliance Jio.“The real competition is actually user habits and cash," said Sengupta. “So, all of us (referring to the other players) are all in many ways brothers-in-arms who are trying to move people’s habits away from cash to digital so that we can move India to a digital economy. At Google, we focus on the users so we don’t think so much of the competition."Since its launch, over 55 million people have downloaded Google Tez and more than 22 million people and businesses actively use the app for digital transactions every month, according to company data and figures quoted by Sengupta. Collectively, they have made more than 750 million transactions, with an annual run rate of over $30 billion.The search giant also announced other initiatives including expanding its Google Station internet access programme to 12,000 villages and cities across Andhra Pradesh, potentially reaching 10 million people; the launch of Project Navlekha, where Google will work with Indian publishers to bring more relevant content online; and a new feature in Google Go app that can pull up any webpage and let users listen to it with each word lighting up as it is read.

你想干什么？你的预期输出是什么？我有一个来自LiveMint的URL列表，我正试图从中获取内容。我希望csv文件中的内容为日期、数据、url@chitown88ok。所以你想要1）日期，2）数据（什么数据？标题？）和3）文章的url？数据-整篇文章的内容。我在看你发布的特定网站，有1篇主要文章（包含全部内容），然后其他文章就是标题。所以我只是想知道到底需要拉什么。只是第一篇文章？哇！这是如此全面。非常感谢。你怎么知道班级是主要的区域？有时，对于某些网页来说，它是如此令人困惑。有什么经验法则吗@bittobennichanI的意思是，在这个链接中，例如：如果我使用arti flow作为我的div类，来刮取内容，它不会给我任何东西@比特Bennichan@Nymeria123我认为没有经验法则。只需检查页面源代码并找出类即可。@Nymeria123我可以获取新链接的数据。考虑用你的代码问一个问题。可能少了些什么。