用Python刮取雅虎财务损益表_Python_Html_Beautifulsoup_Yahoo Finance

用Python刮取雅虎财务损益表

python html

用Python刮取雅虎财务损益表,python,html,beautifulsoup,yahoo-finance,Python,Html,Beautifulsoup,Yahoo Finance,我正在尝试使用Python从损益表中获取数据。具体地说，让我们假设我想要这个数据在一组嵌套的HTML表中结构化。我正在使用模块访问它并检索HTML 我用它来筛选HTML结构，但我不知道如何得到这个数字这是Firefox分析的屏幕截图到目前为止，我的代码是： from bs4 import BeautifulSoup import requests myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual" html = reque

我正在尝试使用Python从损益表中获取数据。具体地说，让我们假设我想要这个

数据在一组嵌套的HTML表中结构化。我正在使用模块访问它并检索HTML

我用它来筛选HTML结构，但我不知道如何得到这个数字

这是Firefox分析的屏幕截图

到目前为止，我的代码是：

from bs4 import BeautifulSoup
import requests

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)

我试着用

all_strong = soup.find_all("strong")

然后得到第17个元素，它恰好是包含我想要的图形的元素，但这看起来很不优雅。大概是这样的：

all_strong[16].parent.next_sibling
...

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

当然，目标是使用搜索我需要的数字的名称（在本例中为“净收入”），然后在HTML表的同一行中获取数字本身

我真的很感激关于如何解决这个问题的任何想法，记住我想应用这个解决方案从其他雅虎财经页面检索一堆其他数据

解决方案/扩展：

下面@wilbur的解决方案起了作用，我对其进行了扩展，以便能够获得任何上市公司的any财务页面上可用的any数字的值（即和）。我的职能如下：

def periodic_figure_values(soup, yahoo_figure): values = [] pattern = re.compile(yahoo_figure) title = soup.find("strong", text=pattern) # works for the figures printed in bold if title: row = title.parent.parent else: title = soup.find("td", text=pattern) # works for any other available figure if title: row = title.parent else: sys.exit("Invalid figure '" + yahoo_figure + "' passed.") cells = row.find_all("td")[1:] # exclude the <td> with figure name for cell in cells: if cell.text.strip() != yahoo_figure: # needed because some figures are indented str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "") if str_value == "-": str_value = 0 value = int(str_value) * 1000 values.append(value) return values
示例用法——我想从最近可用的损益表中获取苹果公司的所得税费用：

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))
输出：
[19121000000、13973000000、13118000000]
您还可以从
soup
中获取时段结束的日期，并创建一个字典，其中日期是键，数字是值，但这会使此帖子过长。
到目前为止，这似乎对我有效，但我总是感谢建设性的批评。
这变得有点困难，因为
标签中包含的“净收入”，请容忍我，但我认为这是可行的：

import re, requests from bs4 import BeautifulSoup url = 'https://finance.yahoo.com/q/is?s=AAPL&annual' r = requests.get(url) soup = BeautifulSoup(r.text, 'html.parser') pattern = re.compile('Net Income') title = soup.find('strong', text=pattern) row = title.parent.parent # yes, yes, I know it's not the prettiest cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income' values = [ c.text.strip() for c in cells ]
当我在Alphabet（谷歌）上测试它时，它不起作用，因为它们没有显示我相信的损益表（），但当我检查Facebook（FB）时，返回的值是正确的（）
如果要创建更具动态性的脚本，可以使用字符串格式将url格式化为所需的任何股票符号，如下所示：

all_strong[16].parent.next_sibling ...

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

谢谢。到目前为止效果很好。现在我只需要让它更有活力一点。不仅是关于股票，还有关于同一股票的其他财务数据，以及检查最新数据等等。但这是一个很好的开始。AttributeError:“NoneType”对象没有属性“parent”