Python BeautifulSoup只找到类中的第一个元素
我正试图浏览以下网站: 特别是,我试图从“关键事实”和“投资组合特征”等获取所有信息。但是,当我运行代码时,它只返回每个项目的第一项,即使每个项目中有8-10项 我觉得循环一旦找到第一个循环就结束了,我怎么才能绕过它呢 我的代码:Python BeautifulSoup只找到类中的第一个元素,python,beautifulsoup,Python,Beautifulsoup,我正试图浏览以下网站: 特别是,我试图从“关键事实”和“投资组合特征”等获取所有信息。但是,当我运行代码时,它只返回每个项目的第一项,即使每个项目中有8-10项 我觉得循环一旦找到第一个循环就结束了,我怎么才能绕过它呢 我的代码: import requests from bs4 import BeautifulSoup import pandas as pd URL = 'https://www.ishares.com/us/products/239726/ishares-core-sp-5
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
for item in soup.find_all('div',{'class':'product-data-list data-points-en_US'}):
label = item.find(class_='caption').text
print(label)
您可以使用此示例了解如何从表中获取数据并创建数据帧:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
d = {}
for c in soup.select(".product-data-list .caption"):
cp = c.contents[0].strip()
dt = c.find_next(class_="data").get_text(strip=True)
d[cp] = dt
df = pd.DataFrame([d])
print(df)
df.to_csv("data.csv", index=False)
印刷品:
净资产开始日期交易所资产类别基准指数彭博指数股票发行中溢价/贴现CUSIP收盘价期权可用30天平均交易量30天平均买卖价差中值每日持有量市盈率市盈率市盈率权益贝塔(3y)30天证券交易委员会收益率标准差(3y)2012年MSCI ESG基金跟踪收益率评级(AAA-CCC)MSCI ESG质量分数(0-10)MSCI ESG质量分数-同行百分位MSCI ESG%覆盖率基金理柏全球分类MSCI加权平均碳强度(吨二氧化碳当量/百万美元销售额)同级集团MSCI中的基金-有争议的武器MSCI-联合国全球契约违反者MSCI-核武器MSCI-热煤MSCI-民用火器MSCI-油砂MSCI-烟草业务参与覆盖率基金未覆盖的管理费收购基金费和费用国外税和其他费用支出比率
0美元273575842264 5月15日,2000年纽约证券交易所Arca股票标准普尔500指数SPTR 662200000-0.04%464287200 412.98是4404180.00 0.01%3052119.00 505 31.39 4.40 1.00 1.39%18.40%1.43%BBB 5.5 48.07%99.56%权益136.45 3283美元0.94%0.69%0.72%0.00%0.15% 0.00% 0.68% 99.73% 0.27% 0.03% 0.00% 0.00% 0.03%
并创建data.csv
:
您可以使用此示例了解如何从表中获取数据并创建数据框:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
d = {}
for c in soup.select(".product-data-list .caption"):
cp = c.contents[0].strip()
dt = c.find_next(class_="data").get_text(strip=True)
d[cp] = dt
df = pd.DataFrame([d])
print(df)
df.to_csv("data.csv", index=False)
印刷品:
净资产开始日期交易所资产类别基准指数彭博指数股票发行中溢价/贴现CUSIP收盘价期权可用30天平均交易量30天平均买卖价差中值每日持有量市盈率市盈率市盈率权益贝塔(3y)30天证券交易委员会收益率标准差(3y)2012年MSCI ESG基金跟踪收益率评级(AAA-CCC)MSCI ESG质量分数(0-10)MSCI ESG质量分数-同行百分位MSCI ESG%覆盖率基金理柏全球分类MSCI加权平均碳强度(吨二氧化碳当量/百万美元销售额)同级集团MSCI中的基金-有争议的武器MSCI-联合国全球契约违反者MSCI-核武器MSCI-热煤MSCI-民用火器MSCI-油砂MSCI-烟草业务参与覆盖率基金未覆盖的管理费收购基金费和费用国外税和其他费用支出比率
0美元273575842264 5月15日,2000年纽约证券交易所Arca股票标准普尔500指数SPTR 662200000-0.04%464287200 412.98是4404180.00 0.01%3052119.00 505 31.39 4.40 1.00 1.39%18.40%1.43%BBB 5.5 48.07%99.56%权益136.45 3283美元0.94%0.69%0.72%0.00%0.15% 0.00% 0.68% 99.73% 0.27% 0.03% 0.00% 0.00% 0.03%
并创建data.csv
:
这是因为您只能找到一个标题类
label=item.find(class='caption')。text
在标题中使用findall并在其上循环,然后您将找到正确的结果。这是因为您只找到一个标题类
label=item.find(class='caption')。text
在标题中使用findall并在其上循环,然后您将找到正确的结果。页面上的每个部分都有一个特定的
div
id,您可以使用此id来定位特定的部分:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf').text, 'html.parser')
def get_section(block):
return [{i.select_one('span.caption').get_text(strip=True):i.select_one('span.data').get_text(strip=True),
'As Of':i.select_one('span.as-of-date').get_text(strip=True)}
for i in block.select('div.product-data-list div.float-left')]
results = {'Key Facts':get_section(d.select_one('#w1467271812603')), 'Portfolio Characteristics':get_section(d.select_one('#w1467271812604'))}
输出:
{'Key Facts': [{'Net Assetsas of Apr 09, 2021': '$273,575,842,264', 'As Of': 'as of Apr 09, 2021'}, {'Inception Date': 'May 15, 2000', 'As Of': ''}, {'Exchange': 'NYSE Arca', 'As Of': ''}, {'Asset Class': 'Equity', 'As Of': ''}, {'Benchmark Index': 'S&P 500 Index', 'As Of': ''}, {'Bloomberg Index Ticker': 'SPTR', 'As Of': ''}, {'Shares Outstandingas of Apr 09, 2021': '662,200,000', 'As Of': 'as of Apr 09, 2021'}, {'Premium/Discountas of Apr 09, 2021': '-0.04%', 'As Of': 'as of Apr 09, 2021'}, {'CUSIP': '464287200', 'As Of': ''}, {'Closing Priceas of Apr 09, 2021': '412.98', 'As Of': 'as of Apr 09, 2021'}, {'Options Available': 'Yes', 'As Of': ''}, {'30 Day Avg. Volumeas of Apr 09, 2021': '4,404,180.00', 'As Of': 'as of Apr 09, 2021'}, {'30 Day Median Bid/Ask Spreadas of Apr 09, 2021': '0.01%', 'As Of': 'as of Apr 09, 2021'}, {'Daily Volumeas of Apr 09, 2021': '3,052,119.00', 'As Of': 'as of Apr 09, 2021'}],
'Portfolio Characteristics': [{'Number of Holdingsas of Apr 09, 2021': '505', 'As Of': 'as of Apr 09, 2021'}, {'P/E Ratioas of Apr 09, 2021': '31.39', 'As Of': 'as of Apr 09, 2021'}, {'P/B Ratioas of Apr 09, 2021': '4.40', 'As Of': 'as of Apr 09, 2021'}, {'Equity Beta (3y)as of Mar 31, 2021': '1.00', 'As Of': 'as of Mar 31, 2021'}, {'30 Day SEC Yieldas of Mar 31, 2021': '1.39%', 'As Of': 'as of Mar 31, 2021'}, {'Standard Deviation (3y)as of Mar 31, 2021': '18.40%', 'As Of': 'as of Mar 31, 2021'}, {'12m Trailing Yieldas of Mar 31, 2021': '1.43%', 'As Of': 'as of Mar 31, 2021'}]}
页面上的每个部分都有一个特定的
div
id,您可以使用此id来定位特定部分:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf').text, 'html.parser')
def get_section(block):
return [{i.select_one('span.caption').get_text(strip=True):i.select_one('span.data').get_text(strip=True),
'As Of':i.select_one('span.as-of-date').get_text(strip=True)}
for i in block.select('div.product-data-list div.float-left')]
results = {'Key Facts':get_section(d.select_one('#w1467271812603')), 'Portfolio Characteristics':get_section(d.select_one('#w1467271812604'))}
输出:
{'Key Facts': [{'Net Assetsas of Apr 09, 2021': '$273,575,842,264', 'As Of': 'as of Apr 09, 2021'}, {'Inception Date': 'May 15, 2000', 'As Of': ''}, {'Exchange': 'NYSE Arca', 'As Of': ''}, {'Asset Class': 'Equity', 'As Of': ''}, {'Benchmark Index': 'S&P 500 Index', 'As Of': ''}, {'Bloomberg Index Ticker': 'SPTR', 'As Of': ''}, {'Shares Outstandingas of Apr 09, 2021': '662,200,000', 'As Of': 'as of Apr 09, 2021'}, {'Premium/Discountas of Apr 09, 2021': '-0.04%', 'As Of': 'as of Apr 09, 2021'}, {'CUSIP': '464287200', 'As Of': ''}, {'Closing Priceas of Apr 09, 2021': '412.98', 'As Of': 'as of Apr 09, 2021'}, {'Options Available': 'Yes', 'As Of': ''}, {'30 Day Avg. Volumeas of Apr 09, 2021': '4,404,180.00', 'As Of': 'as of Apr 09, 2021'}, {'30 Day Median Bid/Ask Spreadas of Apr 09, 2021': '0.01%', 'As Of': 'as of Apr 09, 2021'}, {'Daily Volumeas of Apr 09, 2021': '3,052,119.00', 'As Of': 'as of Apr 09, 2021'}],
'Portfolio Characteristics': [{'Number of Holdingsas of Apr 09, 2021': '505', 'As Of': 'as of Apr 09, 2021'}, {'P/E Ratioas of Apr 09, 2021': '31.39', 'As Of': 'as of Apr 09, 2021'}, {'P/B Ratioas of Apr 09, 2021': '4.40', 'As Of': 'as of Apr 09, 2021'}, {'Equity Beta (3y)as of Mar 31, 2021': '1.00', 'As Of': 'as of Mar 31, 2021'}, {'30 Day SEC Yieldas of Mar 31, 2021': '1.39%', 'As Of': 'as of Mar 31, 2021'}, {'Standard Deviation (3y)as of Mar 31, 2021': '18.40%', 'As Of': 'as of Mar 31, 2021'}, {'12m Trailing Yieldas of Mar 31, 2021': '1.43%', 'As Of': 'as of Mar 31, 2021'}]}