Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/315.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python BeautifulSoup-lxml和html5lib解析器消除差异_Python_Web Scraping_Beautifulsoup_Lxml_Html5lib - Fatal编程技术网

Python BeautifulSoup-lxml和html5lib解析器消除差异

Python BeautifulSoup-lxml和html5lib解析器消除差异,python,web-scraping,beautifulsoup,lxml,html5lib,Python,Web Scraping,Beautifulsoup,Lxml,Html5lib,我正在将Beautifulsoup4与Python2.7一起使用。我想从网站中提取某些元素(数量,见下面的示例)。出于某种原因,lxml解析器不允许我从页面中提取所有所需的元素。它将只打印前三个元素。我正在尝试使用html5lib解析器来查看是否可以提取所有这些内容 该页面包含多个项目及其价格和数量。包含每个项目所需信息的代码部分如下所示: <td class="size-price last first" colspan="4"> <s

我正在将Beautifulsoup4Python2.7一起使用。我想从网站中提取某些元素(数量,见下面的示例)。出于某种原因,lxml解析器不允许我从页面中提取所有所需的元素。它将只打印前三个元素。我正在尝试使用html5lib解析器来查看是否可以提取所有这些内容

该页面包含多个项目及其价格和数量。包含每个项目所需信息的代码部分如下所示:

<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>
案例2-LXML:

#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""                
soup = BeautifulSoup(data)
print soup.td.span.text
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text
印刷品:

453 grams 
453 grams
案例3-HTML5LIB:

#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""                
soup = BeautifulSoup(data)
print soup.td.span.text
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text
我得到以下错误:

Traceback (most recent call last):
  File "C:\Users\Dom\Python-Code\src\Testing-Code.py", line 6, in <module>
    print soup.find('td', {'class': 'size-price'}).span.text
AttributeError: 'NoneType' object has no attribute 'span'
请尝试以下方法:

    from bs4 import BeautifulSoup
    data = """
    <td class="size-price last first" colspan="4">
                <span>453 grams </span>
        <span> <span class="strike">$619.06</span> <span 
    class="price">$523.91</span>
                </span>
            </td>"""                
    soup = BeautifulSoup(data)
    text = soup.get_text(strip=True)
    print text
从bs4导入美化组
data=”“”
453克
$619.06 $523.91
"""                
汤=美汤(数据)
text=soup.get_text(strip=True)
打印文本

html5lib
省略了
td
标记,并将所有内容放在html主体中-这是因为
td
周围没有
标记,
html5lib
关注它。有趣的是,现在我应该如何使用html5libWell提取我想要的元素,为什么要使用
html5lib
?仅供参考,您还可以使用
html.parser
,例如:
BeautifulSoup(网页,'html.parser')
。运行时警告:Python内置的HTMLParser无法解析给定文档。这不是美汤里的虫子。最好的解决方案是安装一个外部解析器(lxml或html5lib),并与该解析器一起使用BeautifulSoup。请参阅以寻求帮助<代码>“Python的内置HTMLPasser无法解析给定文档。这不是Beauty Soup中的错误。最好的解决方案是安装外部解析器(lxml或html5lib),并将Beauty Soup与该解析器一起使用。Gotcha:)我可以看到您尝试解析的整个html文档(或链接)吗?