Python 用Beauty Soup 4解析不平衡的html文件_Python_Html_Beautifulsoup

Python 用Beauty Soup 4解析不平衡的html文件

python html

Python 用Beauty Soup 4解析不平衡的html文件,python,html,beautifulsoup,Python,Html,Beautifulsoup,我正在解析不带平衡html标记的部分html文件假设此部分html文件中缺少第一行。Beauty Soup是否仍然可以解析其余的文件，并且我仍然可以提取不同标记内部的信息非常感谢你的帮助 Example Domain</title>  <meta charset="utf-8" /> <meta http-equiv="Content-type" content=

我正在解析不带平衡html标记的部分html文件

假设此部分html文件中缺少第一行。Beauty Soup是否仍然可以解析其余的文件，并且我仍然可以提取不同标记内部的信息

非常感谢你的帮助

Example Domain</title>   <!-- <====missing tag in this line -->

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>

示例域
身体{
背景色：#F0F2；
保证金：0；
填充：0；
字体系列：“开放式Sans”、“Helvetica Neue”、Helvetica、Arial、Sans serif；
}
div{
宽度：600px；
保证金：5em自动；
填充：50px；
背景色：#fff；
边界半径：1米；
}
a:链接，a:已访问{
颜色：#38488f；
文字装饰：无；
}
@介质（最大宽度：700px）{
身体{
背景色：#fff；
}
div{
宽度：自动；
保证金：0自动；
边界半径：0；
填充：1em；
}
}

使用任何高级解析器（

html5lib

更健壮，但速度较慢）。结果会有所不同：

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>

soup=BeautifulSoup（打开（'foo.html'），'lxml'）
# 示例域
#
soup=BeautifulSoup（打开（'foo.html'），'html5lib'）
#示例域
#
#

您需要指定一个不是默认的解析器。您可以尝试

lxml

或

html5lib

。我在这两方面都没有经验。这是我在尝试使用lxml“bs4.FeatureNotFound:找不到具有您请求的功能的树生成器：lxml。是否需要安装解析器库？”切换到html5lib解析器时收到的类似错误消息“bs4.FeatureNotFound:找不到具有您请求的功能的树生成器：html5lib。你需要安装解析器库吗？“我尝试pip安装这两个库，但都失败了。我正在使用OSX 10.9.5.Python3.4.4。任何想法都很感激！你收到pip的错误消息了吗？我安装了

pip html5lib

，下面的代码适用于我

，来自bs4导入BeautifulSoup；soup=BeautifulSoup（“asdf”，“HTMLlib”）；打印（汤）