Python 使用BeautifulSoup4完成后，如何使汤对象进一步过滤？_Python_Beautifulsoup

Python 使用BeautifulSoup4完成后，如何使汤对象进一步过滤？

python

Python 使用BeautifulSoup4完成后，如何使汤对象进一步过滤？,python,beautifulsoup,Python,Beautifulsoup,我正试图学习如何解析HTML数据，所以我选择了这个网站（），它有电价的实时数据 from bs4 import BeautifulSoup from urllib.request import urlopen import pandas as pd url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet" page = urlopen(url)

我正试图学习如何解析HTML数据，所以我选择了这个网站（），它有电价的实时数据

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    import pandas as pd

    url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
    page = urlopen(url)
    html = page.read().decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")


    print(soup.get_text())

我的问题是，一旦BS4对象被创建，我如何进一步解析/过滤它

我只想要更新的数据，这样我就可以把它放到一个数据框中

像这样：

日期（HE）|时间|价格（$）|容量（MW） 01/15/2021| 14 |13:10 |40.16| 80 01/15/2021| 14 |13:05 |40.18| 100

2021年1月15日| 14 | 13:00 | 40.16 | 80正如goalie1998所述，最有效的方法是使用熊猫，只需两行代码

示例

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd

url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

data = []
rows = soup.select('table:not([class]) tr')
for i,row in enumerate(rows):
    if i == 0:
        cols = row.find_all('th')
    else:
        cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
    
pd.DataFrame(data[1:], columns=data[0])

请注意，pandas

read_html（）

将所有表存储在数据帧列表中。因为您想要的表是源代码中的第三个表，您可以通过使用

[2]

import pandas as pd
pd.read_html('http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet')[2]

输出

    Date (HE)   Time    Price ($)   Volume (MW)
0   01/15/2021 14   13:10   40.16   80
1   01/15/2021 14   13:05   40.18   100
2   01/15/2021 14   13:00   40.16   80
3   01/15/2021 13   12:40   40.16   80
4   01/15/2021 13   12:00   40.01   100
5   01/15/2021 12   11:54   40.01   100
6   01/15/2021 12   11:00   40.18   100
7   01/15/2021 11   10:54   40.18   100
8   01/15/2021 11   10:24   40.16   80
9   01/15/2021 11   10:00   40.18   100

以防万一，你真的必须使用

beautifulsoup

选择站点上没有类的第三个

表

或

表

选择

表中的所有tr
，并在其上循环并附加数据列表中td
s中的所有文本


将数据列表导入到pandas
示例
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd

url = "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSMPriceReportServlet"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

data = []
rows = soup.select('table:not([class]) tr')
for i,row in enumerate(rows):
    if i == 0:
        cols = row.find_all('th')
    else:
        cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
    
pd.DataFrame(data[1:], columns=data[0])

根据页面的格式，您可以尝试pandas.read_html（），如果在该网站上效果不好，您可以使用beautifulsoup.find/find_all