使用Python和Selenium从具有可扩展表的网站提取表内容_Python_Html_Selenium

使用Python和Selenium从具有可扩展表的网站提取表内容

python html selenium

使用Python和Selenium从具有可扩展表的网站提取表内容,python,html,selenium,Python,Html,Selenium,我想从这个网站上提取以下数字：我尝试使用Selenium，并成功地按行提取数字： 4 806 1 709 486 4 025 2 120 435 526 15 2 -38 12 2 -48 7 2 但后来我意识到这只是最近三年的2017年、2016年和2015年 from selenium import webdriver from selenium.webdriver.common.keys im

我想从这个网站上提取以下数字：

我尝试使用Selenium，并成功地按行提取数字：

4 806   1 709   486 
4 025   2 120   435 
526       15    2   
-38       12    2   
-48       7     2

但后来我意识到这只是最近三年的2017年、2016年和2015年

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome(executable_path="/Users/gabriele/Downloads/chromedriver")
driver.get("https://www.allabolag.se/5569640369/bokslut")

income_statement_raw = driver.find_element(By.ID, "bokslut")

income_statement_raw_box = income_statement_raw.find_elements_by_class_name("box")

#expected 4806  1709   486  177

year_count_of_financial_data_raw = income_statement_raw_box[0].find_elements_by_xpath('//div[@class="table__container table__container--padding-bleed-x box__bleed-x--up-to-small"]//table[@class="table--background-separator company-table"]/tbody')

print(year_count_of_financial_data_raw[0].text)

driver.close()

我希望收到4个数字，因为我可以在html see图像中看到它：

我已经使用BeautifulSoup为您解析了网页

我不能100%确定您要提取的数据，因此我将重点放在您在文章中显示的预期数据上，但在数据变量中，您将找到提取表中包含的所有行

请记住将您平台的chromedriver放在脚本文件夹中，取消对无标题行的注释，使浏览器不可见

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.allabolag.se/5569640369/bokslut"
options = Options()
#options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
first_table = soup.select_one("table:nth-of-type(1)")

data = []
rows = first_table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip().replace(" ", "") for ele in cols]
    data.append([ele for ele in cols if ele]) 

print(data[1])
#>>> ['4806', '1709', '486', '177']

以前所有年份的表数据似乎都是用html加载的，如果你想全部提取，我建议用你在scrapy、bs4、请求、HTMLParser e.t.cI中知道的模块来删除站点。我问了另一个问题，这就是为什么我不能从html中提取所有的tds。哦，哇，这太棒了。非常感谢。我以前没有见过table:nth类型的用法。我会更深入地研究它。很高兴看到它很有用。这是一个简单的CSS选择器，您可以在此处阅读更多内容：

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.allabolag.se/5569640369/bokslut"
options = Options()
#options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
first_table = soup.select_one("table:nth-of-type(1)")

data = []
rows = first_table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip().replace(" ", "") for ele in cols]
    data.append([ele for ele in cols if ele]) 

print(data[1])
#>>> ['4806', '1709', '486', '177']