Python 如何从雅虎财经获取特定数据？_Python_Web Scraping_Beautifulsoup_Data Science

Python 如何从雅虎财经获取特定数据？

python web-scraping

Python 如何从雅虎财经获取特定数据？,python,web-scraping,beautifulsoup,data-science,Python,Web Scraping,Beautifulsoup,Data Science,我对网络抓取还不熟悉，我正试图为AAPL抓取雅虎财经的“统计”页面。以下是链接：这是我到目前为止的代码 from bs4 import BeautifulSoup from requests import get url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL' response = get(url) soup = BeautifulSoup(response.text, 'html.parser') s

我对网络抓取还不熟悉，我正试图为AAPL抓取雅虎财经的“统计”页面。以下是链接：

这是我到目前为止的代码

from bs4 import BeautifulSoup
from requests import get


url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stock_data = soup.find_all("table")

for stock in stock_data:
    print(stock.text)

当我运行它时，我返回页面上的所有表数据。但是，我只需要每个表中的特定数据（例如“市值”、“收入”、“β”）

我尝试通过执行

print（stock[1].text）

来处理代码，看看是否可以将返回的数据量限制为每个表中的第二个值，但这会返回一条错误消息。我使用BeautifulSoup是正确的，还是需要使用完全不同的库？为了只返回特定数据而不是返回页面上的所有表数据，我必须做些什么？

检查HTML代码可以让您最好地了解BeautifulSoup将如何处理它所看到的内容

该网页似乎包含多个表，这些表依次包含您要查找的信息。这些表格遵循一定的逻辑

首先清除网页上的所有表，然后查找这些行包含的所有表行（）和表数据（）

下面是实现这一目标的一种方法。我甚至加入了一个只打印特定测量值的函数

from bs4 import BeautifulSoup
from requests import get


url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one

for table in stock_data:
    # Scrape all table rows into variable trs
    trs = table.find_all('tr')
    for tr in trs:
        # Scrape all table data tags into variable tds
        tds = tr.find_all('td')
        # Index 0 of tds will contain the measurement
        print("Measure: {}".format(tds[0].get_text()))
        # Index 1 of tds will contain the value
        print("Value: {}".format(tds[1].get_text()))
        print("")


def get_measurement(table_array, measurement):
    for table in table_array:
        trs = table.find_all('tr')
        for tr in trs:
            tds = tr.find_all('td')
            if measurement.lower() in tds[0].get_text().lower():
                return(tds[1].get_text())


# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))

虽然这不是雅虎财经，但你可以做类似的事情

import requests
from bs4 import BeautifulSoup

base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})

light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")

data = []
for rows_set in (light_rows, dark_rows):
    for row in rows_set:
        row_data = []
        for cell in row.find_all('td'):
            val = cell.a.get_text()
            row_data.append(val)
        data.append(row_data)

#   sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))

import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)

这是一个很好的替代品，以防雅虎决定贬低其API的更多功能。我知道他们在几年前删掉了很多东西（大部分是历史名言）。看到这种情况消失，我很难过

非常感谢您！代码适用于一个URL，但当我尝试实现for循环来迭代更多URL时，我在tds[0]中的

if measurement.lower（）中遇到一个错误。get_text（）.lower（）：

。它说列表索引超出了范围。这是因为UDF“get_measurement”是一个元组，不能遍历超过URL的内容吗？

get_measurement

是一个接受两个值的函数。发生错误的原因可能是tds数组为空，因此无法获取索引为0的项。所以出于某种原因，它没有被填充。使用打印语句来找出原因。如果您发现我的输入有用，请考虑将我的答案作为“被接受”的投票/标记。