Python 美丽的汤<；部门>；及<；p>；编入词典_Python_Html_Parsing_Beautifulsoup

Python 美丽的汤<；部门>；及<；p>；编入词典

python html parsing

Python 美丽的汤<；部门>；及<；p>；编入词典,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我正在为位置存储数据处理一个混乱的HTML部分，并且很难清晰地解析它。我在这里读过其他几篇文章，但都没能成功下面是来自txt文件的HTML的一部分： " ^ class=""location""> <h2> <a href=""/Locations/AL/5-

我正在为位置存储数据处理一个混乱的HTML部分，并且很难清晰地解析它。我在这里读过其他几篇文章，但都没能成功

下面是来自txt文件的HTML的一部分：

"
                    ^ class=""location"">
                        <h2>
                            <a href=""/Locations/AL/5-Points-In-Line"">5 Points In-Line</a>
                        </h2>

                        <p>
                            2000 Highland Ave S
                            <br/>
                            Birmingham, AL 35205
                            <br/>
                            (205) 930-8000                        
                        </p>
                    </div>
                    ^ class=""location"">
                        <h2>

                            <a href=""/Locations/AL/Airport-Blvd-AL"">Airport Blvd (AL)</a>
                        </h2>

                        <p>
                            4707 Airport Blvd
                            <br/>Mobile, AL 36608
                                <br/>
(251) 461-9933                        </p>
                    </div>
                    ^ class=""location"">
                        <h2>

                            <a href=""/Locations/AL/Alabama-Power"">Alabama Power</a>
                        </h2>

                        <p>
                            600 18th St N
                            <br/>Birmingham, AL 35203
                                <br/>
(205) 257-1688                        </p>
                    </div>

获取密钥错误：“行中有5个点”

我参考了下面类似的文章，但是我无法得到有效的结果，我想我必须解析这些文件

您可以使用

查找下一步（）

并将值添加到

dict

：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')


output = {}
for tag in soup.select('h2 a'):
    output.setdefault(tag.get_text(), []).append(tag.find_next('p').get_text(strip=True, separator=' '))
    
print(output)

输出：

{'5 Points In-Line': ['2000 Highland Ave S Birmingham, AL 35205 (205) 930-8000'], 'Airport Blvd (AL)': ['4707 Airport Blvd Mobile, AL 36608 (251) 461-9933'], 'Alabama Power': ['600 18th St N Birmingham, AL 35203 (205) 257-1688']}

请正确地重新格式化HTML，这就是我的文件看起来的样子，这救了我。是否有必要在S和伯明翰之间增加一个间隔？

{'5 Points In-Line': ['2000 Highland Ave S Birmingham, AL 35205 (205) 930-8000'], 'Airport Blvd (AL)': ['4707 Airport Blvd Mobile, AL 36608 (251) 461-9933'], 'Alabama Power': ['600 18th St N Birmingham, AL 35203 (205) 257-1688']}