Python 将网站表格转换为熊猫df（beautifulsoup不识别表格）_Python_Pandas_Beautifulsoup

Python 将网站表格转换为熊猫df（beautifulsoup不识别表格）

python pandas

Python 将网站表格转换为熊猫df（beautifulsoup不识别表格）,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我想将网站表转换为pandas df，但BeautifulSoup无法识别该表（下图截图）。下面是我在没有运气的情况下尝试的代码我也尝试了下面的代码，但没有成功 df = pd.read_html('https://www.ndbc.noaa.gov/ship_obs.php') print(df) 您的表不在标记中，而是在多个标记中您可以将其解析为数据帧，如下所示： import pandas as pd import requests import bs4 url = f"

我想将网站表转换为pandas df，但

BeautifulSoup

无法识别该表（下图截图）。下面是我在没有运气的情况下尝试的代码

我也尝试了下面的代码，但没有成功

df = pd.read_html('https://www.ndbc.noaa.gov/ship_obs.php')
print(df)

您的表不在

标记中，而是在多个

标记中

您可以将其解析为数据帧，如下所示：

import pandas as pd
import requests
import bs4

url = f"https://www.ndbc.noaa.gov/ship_obs.php"
soup = bs4.BeautifulSoup(requests.get(url).text, 'html.parser').find('pre').find_all("span")
print(pd.DataFrame([r.getText().split() for r in soup]))

输出：

      0     1     2      3     4     5   ...    40    41    42    43    44    45
0    SHIP  HOUR   LAT    LON  WDIR  WSPD  ...    °T    ft   sec    °T   Acc   Ice
1    SHIP    19  46.5  -72.3   260   5.1  ...  None  None  None  None  None  None
2    SHIP    19  46.8  -71.2   110   2.9  ...  None  None  None  None  None  None
3    SHIP    19  47.4  -61.8    40  18.1  ...  None  None  None  None  None  None
4    SHIP    19  47.7  -53.2    40   8.0  ...  None  None  None  None  None  None
..    ...   ...   ...    ...   ...   ...  ...   ...   ...   ...   ...   ...   ...
170  SHIP    19  17.6  -62.4   100  20.0  ...  None  None  None  None  None  None
171  SHIP    19  25.8  -78.0    40  24.1  ...  None  None  None  None  None  None
172  SHIP    19   1.5  104.8    20  22.0  ...  None  None  None  None  None  None
173  SHIP    19  57.9    1.2   180     -  ...  None  None  None  None  None  None
174  SHIP    19  35.1  -10.0   310  24.1  ...  None  None  None  None  None  None

[175 rows x 46 columns]

方法稍有不同，也可以查看列计数。我跳过了顶部的行，因此您必须构建列标题并清理最后一行

import io
url = 'https://www.ndbc.noaa.gov/ship_obs.php'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
tablecontent = soup.find('pre')
data = BeautifulSoup(tablecontent.text, "html.parser")
s = io.StringIO(data.text)
df = pd.read_csv(s, sep='\s+', engine='python', skiprows=3, header=None)

输出（很抱歉，从jupyter中复制不正确）

好。数据不存储在表中。这是一堆标签。我想这对你有帮助。

import io
url = 'https://www.ndbc.noaa.gov/ship_obs.php'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
tablecontent = soup.find('pre')
data = BeautifulSoup(tablecontent.text, "html.parser")
s = io.StringIO(data.text)
df = pd.read_csv(s, sep='\s+', engine='python', skiprows=3, header=None)

    0   1   2   3   4   5   6   7   8   9   ... 14  15  16  17  18  19  20  21  22  23
0   SHIP    19  47.4    -61.8   40  18.1    -   -   -   29.82   ... -   -   -   -   -   -   -   -   ----    -----
1   SHIP    19  47.7    -53.2   40  8.0 -   -   -   29.76   ... -   -   -   -   -   -   -   -   ----    -----
2   SHIP    19  47.8    -54.1   50  13.0    -   -   -   29.75   ... -   -   -   -   -   -   -   -   ----    -----
3   SHIP    19  48.2    -53.4   50  13.0    -   -   -   29.78   ... -   -   -   -   -   -   -   -   ----    -----
4   SHIP    19  46.8    -71.2   110 2.9 -   -   -   30.03   ... -   -   -   -   -   -   -   -   ----    -----
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
178 SHIP    19  25.8    -78.0   40  24.1    -   4.9 4.0 30.08   ... 11  5   -   -   -   -   -   -   ----    -----
179 SHIP    19  1.5 104.8   20  22.0    -   -   -   29.87   ... 11  5   -   -   -   -   -   -   ----    -----
180 SHIP    19  57.9    1.2 180 -   -   -   -   29.35   ... 5   -   -   -   -   -   -   -   ----    -----
181 SHIP    19  35.1    -10.0   310 24.1    -   6.6 6.0 29.68   ... 5   8   14.8    10.0    310 -   -   -   ----    -----
182 182 ship    observations    reported    for 1900    GMT None    None    None    ... None    None    None    None    None    None    None    None    None    None