Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/blackberry/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用BS4随时间刮除_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 使用BS4随时间刮除

Python 使用BS4随时间刮除,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我对python和BS4比较陌生,我想从一个特定的网站上搜集一些新闻 我的目标是根据今天的日期获取父URL的新闻,但是当我尝试这样做时,它返回了一个空白的csv文件。请建议我如何解决这个问题或改进!提前谢谢 这是我的密码: from bs4 import BeautifulSoup import requests, re, pprint from datetime import date import csv today = date.today() d2 = today.strftime(&

我对python和BS4比较陌生,我想从一个特定的网站上搜集一些新闻

我的目标是根据今天的日期获取父URL的新闻,但是当我尝试这样做时,它返回了一个空白的csv文件。请建议我如何解决这个问题或改进!提前谢谢

这是我的密码:

from bs4 import BeautifulSoup
import requests, re, pprint
from datetime import date
import csv

today = date.today()
d2 = today.strftime("%B %d, %Y")

result = requests.get('https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/')

soup = BeautifulSoup(result.content, "lxml")

urls =[]
titles = []
contents = []

#collect all links from 'latest news' into a list
for item in soup.find_all("a"):
    url = item.get("href")
    market_intelligence_pattern = re.compile("^/marketintelligence/en/news-insights/latest-news-headlines/.*")
    if re.findall(market_intelligence_pattern, url):
        if re.findall(market_intelligence_pattern, url)[0] == "/marketintelligence/en/news-insights/latest-news-headlines/index":
            continue
        else:
            news = "https://www.spglobal.com/"+re.findall(market_intelligence_pattern, url)[0]
            urls.append(news)
    else:
        continue

newfile = open('output.csv','w',newline='')
outputWriter = csv.writer(newfile)

#extract today's articles = format: date,title,content
for each in urls:
    individual = requests.get(each)
    soup2 = BeautifulSoup(individual.content, "lxml")
    date = soup2.find("ul",class_="meta-data").text.strip() #getting the date
    #print(date)
    if d2 != date: #today's articles only
        continue
    else:
        title = soup2.find("h2", class_="article__title").text.strip() #getting the title
        titles.append(title)
        #print(title)
        precontent = soup2.find("div", class_="wysiwyg-content") #getting content
        content = precontent.findAll("p")
        indi_content = []
        for i in content:
            indi_content.append(i.text)
            #contents.append(content)
    outputWriter.writerow([date,title,indi_content])

也许这会把你推向正确的方向:

from datetime import date

import requests
from bs4 import BeautifulSoup


result = requests.get('https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/')
soup = BeautifulSoup(result.content, "lxml").find_all("a")


for item in soup:
    if item['href'].startswith("/marketintelligence/en/news-insights/latest") and not item['href'].endswith("index"):
        article_soup = BeautifulSoup(requests.get(f"https://spglobal.com{item['href']}").content, "lxml")
        article_date = article_soup.find("li", {"class": "meta-data__date"})
        if article_date.getText(strip=True) == str(date.today().strftime("%d %b, %Y")):
            print(article_soup.find("h2", {"class": "article__title"}).getText(strip=True))
        else:
            continue
如果日期与今天的日期匹配,则打印文章标题

输出:

Houston, America's fossil fuel capital, braces for the energy transition
Blackstone to sell BioMed for $14.6B; Simon JV deal talks for J.C. Penney stall
Next mega-turbine is coming but 'the sky has a limit,' says MHI Vestas CEO

也许这会把你推向正确的方向:

from datetime import date

import requests
from bs4 import BeautifulSoup


result = requests.get('https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/')
soup = BeautifulSoup(result.content, "lxml").find_all("a")


for item in soup:
    if item['href'].startswith("/marketintelligence/en/news-insights/latest") and not item['href'].endswith("index"):
        article_soup = BeautifulSoup(requests.get(f"https://spglobal.com{item['href']}").content, "lxml")
        article_date = article_soup.find("li", {"class": "meta-data__date"})
        if article_date.getText(strip=True) == str(date.today().strftime("%d %b, %Y")):
            print(article_soup.find("h2", {"class": "article__title"}).getText(strip=True))
        else:
            continue
如果日期与今天的日期匹配,则打印文章标题

输出:

Houston, America's fossil fuel capital, braces for the energy transition
Blackstone to sell BioMed for $14.6B; Simon JV deal talks for J.C. Penney stall
Next mega-turbine is coming but 'the sky has a limit,' says MHI Vestas CEO