Python 分解列表的解析_Python_Beautifulsoup

Python 分解列表的解析

python

Python 分解列表的解析,python,beautifulsoup,Python,Beautifulsoup,我需要从财务雅虎获取数据/字符串。但是，相关信息在明细表下“隐藏” 如您所见，我可以访问其他数据，例如总收入、收入成本。当我试图访问隐藏在明细表-流动资产、库存（位于总资产和流动资产部分）下的数据时，会出现问题 Python引发AttributeError:'NoneType'对象没有属性'find_next'错误，我不认为这是说明性的另外，我通过注释每一行发现问题在于这些元素导入urllib.request作为url 从bs4导入BeautifulSoup 公司=输入（'输入公司缩写'）

我需要从财务雅虎获取数据/字符串。但是，相关信息在明细表下“隐藏”

如您所见，我可以访问其他数据，例如总收入、收入成本。当我试图访问隐藏在明细表-流动资产、库存（位于总资产和流动资产部分）下的数据时，会出现问题

Python引发AttributeError:'NoneType'对象没有属性'find_next'错误，我不认为这是说明性的

另外，我通过注释每一行发现问题在于这些元素

导入urllib.request作为url
从bs4导入BeautifulSoup
公司=输入（'输入公司缩写'）
收入https://finance.yahoo.com/quote/“+公司+”/财务/”
平衡页https://finance.yahoo.com/quote/“+公司+”/资产负债表/”
set_income_page=url.urlopen（income_page）.read（）
set_balance_page=url.urlopen（balance_page）.read（）
soup\u income=BeautifulSoup（设置\u income\u页面“html.parser”）
soup\u balance=BeautifulSoup（设置\u balance\u页面'html.parser'）
收入元素=收入。查找（'span'，字符串（'Total revenue'）。查找下一个（'span'）。文本
cogs_元素=收入。查找（'span'，字符串（'Cost of Revenue'）。查找下一个（'span'）。文本
息税前利润元素=收入。查找（'span'，字符串（'Operating income'）。查找下一个（'span'）。文本
net_element=soup_income.find（'span'，string='Pretax income'）。find_next（'span'）。text
short\u assets\u element=soup\u balance.find（'span'，string='Current assets'）。find\u next（'span'）。text
inventory\u element=soup\u balance.find（'span'，string='inventory'）。find\u next（'span'）。text

以下是使用selenium解析此网页的示例。它允许模拟用户行为：等待页面加载，关闭弹出窗口，通过单击扩展treenode并从中提取一些信息

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

company = input('enter companies abbreviation: ')

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<<PATH_TO_CHROMEDRIVER>>', options=chrome_options)

# delay (how long selenium waits for element to be loaded)
DELAY = 30

# maximize browser window
wd.maximize_window()

# load page via selenium
wd.get('https://finance.yahoo.com/quote/' + company + '/financials/')

# check for popup, close it
try:
    btn = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[text()="I agree"]')))
    wd.execute_script("arguments[0].scrollIntoView();", btn)
    wd.execute_script("arguments[0].click();", btn)
except:
    pass

# wait for page to load
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.ID, 'Col1-1-Financials-Proxy')))

# parse content
soup_income = BeautifulSoup(results.get_attribute('innerHTML'), 'html.parser')

# extract values
revenue_element = soup_income.find('span', string='Total Revenue').find_next('span').text
cogs_element = soup_income.find('span', string='Cost of Revenue').find_next('span').text
ebit_element = soup_income.find('span', string='Operating Income').find_next('span').text
net_element = soup_income.find('span', string='Pretax Income').find_next('span').text

# load page via selenium
wd.get('https://finance.yahoo.com/quote/' + company + '/balance-sheet/')

# wait for page to load
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.ID, 'Col1-1-Financials-Proxy')))

# expand total assets
btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Total Assets"]/preceding-sibling::button')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)
    
# expand inventory
btn = WebDriverWait(wd, DELAY).until(EC.element_to_be_clickable((By.XPATH, '//span[text()="Current Assets"]/preceding-sibling::button')))
wd.execute_script("arguments[0].scrollIntoView();", btn)
wd.execute_script("arguments[0].click();", btn)

# parse content
soup_balance = BeautifulSoup(results.get_attribute('innerHTML'), 'html.parser')

# extract values
short_assets_element = soup_balance.find('span', string='Current Assets').find_next('span').text
inventory_element = soup_balance.find('span', string='Inventory').find_next('span').text

# close webdriver
wd.quit()

print(revenue_element)
print(cogs_element)
print(ebit_element)
print(net_element)
print(short_assets_element)
print(inventory_element)

从selenium导入webdriver
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
从bs4导入BeautifulSoup
公司=输入（'输入公司缩写：'）
chrome\u options=webdriver.ChromeOptions（）
chrome_选项。添加_参数（'--headless'）
chrome_选项。添加_参数（'--no sandbox'）
wd=webdriver.Chrome（“”，options=Chrome\u options）
#延迟（selenium等待元素加载的时间）
延迟=30
#最大化浏览器窗口
wd.最大化_窗口（）
#通过selenium加载页面
wd.get（'https://finance.yahoo.com/quote/“+公司+”/财务/”）
#检查弹出窗口，关闭它
尝试：
btn=WebDriverWait（wd，DELAY）.until（EC.presence\u of_元素位于（（By.XPATH，//button[text（）=“I agree”]））
wd.execute_脚本（“参数[0].scrollIntoView（）；”，btn）
wd.execute_脚本（“参数[0]。单击（）；”，btn）
除：
通过
#等待页面加载
结果=WebDriverWait（wd，延迟）。直到（EC.元素的存在（（By.ID，'Col1-1-Financials-Proxy'））
#解析内容
soup\u income=beautifulsou（results.get\u属性（'innerHTML'），'html.parser'）
#提取值
收入元素=收入。查找（'span'，字符串（'Total revenue'）。查找下一个（'span'）。文本
cogs_元素=收入。查找（'span'，字符串（'Cost of Revenue'）。查找下一个（'span'）。文本
息税前利润元素=收入。查找（'span'，字符串（'Operating income'）。查找下一个（'span'）。文本
net_element=soup_income.find（'span'，string='Pretax income'）。find_next（'span'）。text
#通过selenium加载页面
wd.get（'https://finance.yahoo.com/quote/“+公司+”/资产负债表/”）
#等待页面加载
结果=WebDriverWait（wd，延迟）。直到（EC.元素的存在（（By.ID，'Col1-1-Financials-Proxy'））
#扩大总资产
btn=WebDriverWait（wd，DELAY）.until（EC.element可点击（（By.XPATH，//span[text（）=“Total Assets”]/previous sibling:：button）））
wd.execute_脚本（“参数[0].scrollIntoView（）；”，btn）
wd.execute_脚本（“参数[0]。单击（）；”，btn）
#扩大库存
btn=WebDriverWait（wd，DELAY）.until（EC.element可点击（（By.XPATH，//span[text（）=“Current Assets”]/previous sibling:：button）））
wd.execute_脚本（“参数[0].scrollIntoView（）；”，btn）
wd.execute_脚本（“参数[0]。单击（）；”，btn）
#解析内容
soup\u balance=beautifulsou（results.get\u属性（'innerHTML'），'html.parser'）
#提取值
short\u assets\u element=soup\u balance.find（'span'，string='Current assets'）。find\u next（'span'）。text
inventory\u element=soup\u balance.find（'span'，string='inventory'）。find\u next（'span'）。text
#关闭webdriver
wd.quit（）
打印（收入要素）
打印（cogs_元素）
打印（息税前利润要素）
打印（净元素）
打印（短元素）
打印（库存元素）

此处出现的错误意味着BeautifulSoup无法找到要从中调用的元素

find\u next

（因此

find

None

）。几乎可以肯定，这个标签在获取时并不存在于页面中，而是在单击节标题时生成的。没错。但我如何访问当我点击标题时生成的字符串呢？我首先要确定内容是如何生成的；这将影响提取方法的选择。例如，的答案有一些到资源的链接，这些链接可能会有所帮助。谢谢回复和代码。然而，我一直在纠结这一行，即“btn=WebDriverWait（wd，DELAY）.until（EC.element）to_to_be_clickable（…）”Selenium引发Selenium.common.exceptions.TimeoutException:Message:error。我试图更改延迟时间和其他方法。您能详细说明这个错误（含义等）吗和处理技术。P.s.我在开源上搜索过，但找不到任何“可消化”的信息错误消息，这意味着在配置延迟后按钮不可单击（很可能使用提供的XPath找不到）。在这种情况下，重新运行上一个会话（不带

--headless

标志）并手动检查按钮id是否存在，以及此情况与成功情况的不同之处。