Python 使用BeautifulSoup4完成后,如何使汤对象进一步过滤?
我正试图学习如何解析HTML数据,所以我选择了这个网站(),它有电价的实时数据Python 使用BeautifulSoup4完成后,如何使汤对象进一步过滤?,python,beautifulsoup,Python,Beautifulsoup,我正试图学习如何解析HTML数据,所以我选择了这个网站(),它有电价的实时数据 from bs4 import BeautifulSoup from urllib.request import urlopen import pandas as pd url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet" page = urlopen(url)
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
我的问题是,一旦BS4对象被创建,我如何进一步解析/过滤它
我只想要更新的数据,这样我就可以把它放到一个数据框中
像这样:
日期(HE)|时间|价格($)|容量(MW)
01/15/2021| 14 |13:10 |40.16| 80
01/15/2021| 14 |13:05 |40.18| 100
2021年1月15日| 14 | 13:00 | 40.16 | 80正如goalie1998所述,最有效的方法是使用熊猫,只需两行代码 示例
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
data = []
rows = soup.select('table:not([class]) tr')
for i,row in enumerate(rows):
if i == 0:
cols = row.find_all('th')
else:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
pd.DataFrame(data[1:], columns=data[0])
请注意,pandasread_html()
将所有表存储在数据帧列表中。因为您想要的表是源代码中的第三个表,您可以通过使用[2]
import pandas as pd
pd.read_html('http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet')[2]
输出
Date (HE) Time Price ($) Volume (MW)
0 01/15/2021 14 13:10 40.16 80
1 01/15/2021 14 13:05 40.18 100
2 01/15/2021 14 13:00 40.16 80
3 01/15/2021 13 12:40 40.16 80
4 01/15/2021 13 12:00 40.01 100
5 01/15/2021 12 11:54 40.01 100
6 01/15/2021 12 11:00 40.18 100
7 01/15/2021 11 10:54 40.18 100
8 01/15/2021 11 10:24 40.16 80
9 01/15/2021 11 10:00 40.18 100
以防万一,你真的必须使用
beautifulsoup
表
或表
表中的所有tr
,并在其上循环并附加数据列表中td
s中的所有文本
pandas
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
data = []
rows = soup.select('table:not([class]) tr')
for i,row in enumerate(rows):
if i == 0:
cols = row.find_all('th')
else:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
pd.DataFrame(data[1:], columns=data[0])
根据页面的格式,您可以尝试pandas.read_html(),如果在该网站上效果不好,您可以使用beautifulsoup.find/find_all