Python BeautifulSoup解析的问题_Python_Beautifulsoup

Python BeautifulSoup解析的问题

python

Python BeautifulSoup解析的问题,python,beautifulsoup,Python,Beautifulsoup,我试图用BeautifulSoup解析html页面，但BeautifulSoup似乎根本不喜欢html或该页面。当我运行下面的代码时，方法prettify（）只返回页面的脚本块（见下文）。有人知道为什么会这样吗 import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&

我试图用BeautifulSoup解析html页面，但BeautifulSoup似乎根本不喜欢html或该页面。当我运行下面的代码时，方法prettify（）只返回页面的脚本块（见下文）。有人知道为什么会这样吗

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()

是美联生产的产品

-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
 <!--
     function highlight(img) {
       document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
     }

     function unhighlight(img) {
       document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
     }
//-->
</script>

BeautifulSoup不是魔术：如果传入的HTML太可怕，那么它就无法工作

在本例中，传入的HTML正是这样：对于BeautifulSoup来说，它太脆弱了，无法确定该做什么。例如，它包含如下标记：

脚本类型=“javascript”

（请注意双引号。）

BeautifulSoup文档包含一个部分，如果BeautifulSoup无法解析您的标记，您可以做什么。您需要研究这些备选方案。

我在BeautifulSoup版本“3.0.7a”上测试了此脚本，它返回了看起来正确的输出。我不知道“3.0.7a”和“3.1.0.1”之间发生了什么变化，但请尝试一下。

按照建议使用3.0.7a版。BeautifulSoup3.1被设计为与Python3.0兼容，因此他们不得不将解析器从SGMLParser更改为HTMLParser，后者似乎更容易受到坏HTML的攻击

从：

“Beauty Soup现在基于HTMLParser而不是Python3中的SGMLParser。SGMLParser处理了一些糟糕的HTML，但HTMLParser没有”

在我的例子中，通过执行上述语句，它将返回整个HTML页面。

我在解析以下代码时也遇到了问题：

<script>
        function show_ads() {
          document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
        }
</script>


函数show_ads（）{
文件。填写（“”）；
}

HtmlParserror:错误结束标记：u“”，位于第26行第127列

萨姆试试看。尽管它的名字，它也是用于解析和抓取HTML的。它比BeautifulSoup快得多，而且它甚至比BeautifulSoup更好地处理“损坏”的HTML，因此它可能对您更有用。如果您不想学习lxml API，它也为BeautifulSoup提供了一个兼容API

没有理由再使用BeautifulSoup了，除非你使用的是Google App Engine或其他不允许使用纯Python的东西。

Samj：如果我得到类似的东西

HTMLParser.htmlparserror:坏端标记：u”“

我只是把罪魁祸首从加价中除掉，然后再把它端给美丽集团，一切都很美好：

html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)

html=urllib2.urlopen（url.read（））
html=html.replace（“，”）
soup=BeautifulSoup（html）

在投票否决任何人之前，请给出适当的理由。这会有点道德。哦如果你不明白我的答案，那么愿上帝在这里为你提供更多信息：

<script>
        function show_ads() {
          document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
        }
</script>

html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)