网站中的Python解析表不起作用_Python_Pandas_Web Scraping_Html Parsing

网站中的Python解析表不起作用

python pandas web-scraping

网站中的Python解析表不起作用,python,pandas,web-scraping,html-parsing,Python,Pandas,Web Scraping,Html Parsing,我想从网站中提取一个表。因此，我为德国天气网站编写了以下代码： import pandas as pd df, = pd.read_html("https://www.dwd.de/DE/leistungen/beobachtung/beobachtung.html") print(df) 由于我感到满意，我在以下俄罗斯网站的表格中尝试了相同的代码： import pandas as pd df= pd.read_html("https://www.ifm.com/ru/ru/categ

我想从网站中提取一个表。因此，我为德国天气网站编写了以下代码：

import pandas as pd

df, = pd.read_html("https://www.dwd.de/DE/leistungen/beobachtung/beobachtung.html")

print(df)

由于我感到满意，我在以下俄罗斯网站的表格中尝试了相同的代码：

import pandas as pd

df= pd.read_html("https://www.ifm.com/ru/ru/category/010/010_010/010_010_020#!/S/DD/DM/1/D/0/F/0/T/24")[0]

print(df)

但现在输出看起来有点奇怪：

        {{'LABEL_PRODUCTS' | translate }}  \
        0  {{product.product.name}}  {{product.description}}        1           
    {{'ORDER_DETAIL_SUBTOTAL' | translate}}:          {{'SHOPPING_CART_QUANTITY' | 
    translate}}  \               0                     {{product.quantity}}      
                1       {{subTotal | showPrices : "true"}}                       
                        {{'LABEL_SUM' | translate}}        0
  {{product.totalPrice.formattedValue | showPric...        1   
     NaN        [Program finished]

现在我不知道为什么它不能正确解析表内容。表是否太复杂或标记不正确？

该站点是动态的，因此，您必须使用浏览器操作工具，如

selenium

：

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome()
d.get('https://www.dwd.de/DE/leistungen/beobachtung/beobachtung.html')
table = soup(d.page_source, 'lxml')
headers = [i.text for i in table.find_all('th', {'scope':'col'})]
full_table = (lambda x:[x[i:i+len(headers)] for i in range(0, len(x), len(headers))])([i.text for i in table.find_all('td')])
frame = pd.DataFrame([dict(zip(headers, i)) for i in full_table])

输出：

        Böen   DD   FF   FX  HÖHE  LUFTD.  RR30               Station  TEMP.  \
0        ---  --   ---  ---     0  ------  ----            UFS TW Ems    ---   
1        ---  --   ---  ---     0  ------  ----    UFS Deutsche Bucht    ---   
2        ---   W    13   18     4  1014.2   0.0             Helgoland   
15.1   
3        ---  NW    28   41    26  1012.6   0.0             List/Sylt   
17.1   
4        ---   W    21   32    43  1012.4   0.0             Schleswig   
19.1   
5        ---  --   ---  ---     5  ------  ----       Leuchtturm Kiel    ---   
6        ---   W    23   35    27  1011.8   0.0                  Kiel   
20.9   
7        ---   W    21   24     3  1011.4   0.0               Fehmarn   
19.0   
8   Windböen   W    44   58    42  1010.0   0.0                Arkona   
18.9   
9        ---  NW    15   20    11  1014.9   0.0             Norderney   
16.4   
10       ---  --   ---  ---    32  ------  ----   Leuchtt. Alte Weser    ---   
11       ---  NW    19   30     5  1014.1   0.0              Cuxhaven   
16.8   
12       ---   W    18   28    11  1013.3   0.0          Hamburg-Flh.   
17.4   
13       ---   W    14   27    59  1012.3   0.0              Schwerin   
17.9   
14       ---   W    20   30     4  1011.4   0.0               Rostock   
19.7   
15       ---   W    21   33     2  1010.8   0.0            Greifswald
....
[79 rows x 11 columns]

第二个链接使用角度含义的JavaScript将其表放在一起，在DOM中添加/重新排列元素，并向页面添加数据。我认为熊猫会为此而挣扎。你看到的是Angular解析前表中的内容。您可以在页面源代码中看到这一点。好的，谢谢！我必须使用pandas以外的其他解析器吗？您将很难使用Angular解析页面，因为您不能只读取一个文档，您需要实际渲染页面，然后将其拆分，Angular完成其工作之前的时间是任意的。我建议使用JSON，它将数据提供给angular指令。（）您可以在inspector工具的“网络”选项卡中看到这一点。请，请注意版权和数据的法律问题，如果你打算使用这个饲料。好的，非常感谢！好的，很抱歉耽搁了，现在我有时间测试了。它确实适用于德国气象网站。但所有的“th”元素都有一个合适的标题。但是，如果我在俄语网站上使用它，它只会生成一些正确的单词和许多\n标题。所以我不知道，这个网站有什么问题。标题的输出给出：['ППСччччча'，'Саа'，'\n'，'\n'，'\n']为什么它不能与此网站一起工作？我发现，“table=soup（d.pageаsource，'lxml'）”并没有返回包含完整信息的“th”条目。也许汤在网站上的表现有所不同……我想我发现了错误。浏览器模拟不会找到所有标题，因为浏览器需要大约3秒钟才能完全加载页面。插入一个时间。睡觉（5）前找桌上的汤解决了问题，非常感谢@马丁为迟来的回复道歉！我很高兴你找到了解决办法。谢谢你让我知道这件事。