Python BeautifulSoup在元标记上出错
我有此功能来读取保存在计算机上的已保存HTML文件:Python BeautifulSoup在元标记上出错,python,python-2.7,beautifulsoup,python-2.x,Python,Python 2.7,Beautifulsoup,Python 2.x,我有此功能来读取保存在计算机上的已保存HTML文件: def get_doc_ondrive(self,mypath): the_file = open(mypath,"r") line = the_file.readline() if(line != "")and (line!=None): self.soup = BeautifulSoup(line) else: print "Something is wrong with line:\n\n%r\n\n" %
def get_doc_ondrive(self,mypath):
the_file = open(mypath,"r")
line = the_file.readline()
if(line != "")and (line!=None):
self.soup = BeautifulSoup(line)
else:
print "Something is wrong with line:\n\n%r\n\n" % line
quit()
print "\t\t------------ line: %r ---------------\n" % line
while line != "":
line = the_file.readline()
print "\t\t------------ line: %r ---------------\n" % line
if(line != "")and (line!=None):
print "\t\t\tinner if executes: line: %r\n" % line
self.soup.feed(line)
self.get_word_vector()
self.has_doc = True
执行self.soup=BeautifulSoup(open(mypath,“r”))不会返回任何值,但逐行输入它至少会崩溃,并让我看到一些东西
我在BeautifulSoup.py和sgmllib.py中编辑了回溯中列出的函数
当我尝试运行此功能时,我得到:
me@GIGABYTE-SERVER:code$ python test_docs.py
in sgml.finish_endtag
in _feed: inDocumentEncoding: None, fromEncoding: None, smartQuotesTo: 'html'
in UnicodeDammit.__init__: markup: '<!DOCTYPE html>\n'
in UnicodeDammit._detectEncoding: xml_data: '<!DOCTYPE html>\n'
in sgmlparser.feed: rawdata: '', data: u'<!DOCTYPE html>\n' self.goahead(0)
------------ line: '<!DOCTYPE html>\n' ---------------
------------ line: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n' ---------------
inner if executes: line: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
in sgmlparser.feed: rawdata: u'', data: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n' self.goahead(0)
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 0,literal:0
in sgmlparser.parse_starttag: i: 0, __starttag_text: None, start_pos: 0, rawdata: u'<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 61,literal:0
in sgmlparser.parse_starttag: i: 61, __starttag_text: None, start_pos: 61, rawdata: u'<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
------------ line: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n' ---------------
inner if executes: line: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n'
in sgmlparser.feed: rawdata: u'', data: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n' self.goahead(0)
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 0,literal:0
in sgmlparser.parse_starttag: i: 0, __starttag_text: None, start_pos: 0, rawdata: u'<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n'
in sgml.finish_starttag: tag: u'meta', attrs: [(u'http-equiv', u'content-type'), (u'content', u'text/html; charset=UTF-8')]
in start_meta: attrs: [(u'http-equiv', u'content-type'), (u'content', u'text/html; charset=UTF-8')] declaredHTMLEncoding: u'UTF-8'
in _feed: inDocumentEncoding: u'UTF-8', fromEncoding: None, smartQuotesTo: 'html'
in UnicodeDammit.__init__: markup: None
in UnicodeDammit._detectEncoding: xml_data: None
me@GIGABYTE-服务器:代码$python test_docs.py
在sgml.finish_endtag中
在“U提要:inDocumentEncoding:None、fromcodeding:None、smartQuotesTo:'html”中
在UnicodeMit中。\u初始化\u:标记:'\n'
在UnicodeMit中。\u detectEncoding:xml\u数据:'\n'
在sgmlparser.feed中:rawdata:'',数据:u'\n'self.goahead(0)
------------行:'\n'---------------
------------行:'\n'---------------
内部if执行:行:'\n'
在sgmlparser.feed:rawdata:u“”中,数据:'\n'self.goahead(0)
在sgmlparser.goahead:end:0中,rawdata[i]:u'我刚刚通读了一遍,我想我理解了这个问题。从本质上讲,以下是BeautifulSoup认为事情应该如何发展:
使用整个标记调用BeautifulSoup
它将self.markup
设置为该标记
它自己调用\u提要
,重置文档并以最初检测到的编码对其进行解析
在自我馈送时,它会找到一个表示不同编码的meta
标记
要使用这种新编码,它会再次对自身调用\u feed
,从而重新解析self.markup
第一个\u提要
以及它递归到的\u提要
完成后,它将self.markup
设置为None
。(毕竟,我们现在已经解析了所有内容;
谁还需要原始标记?
)
但是你使用它的方式:
使用标记的第一行调用beautifulsou
它将self.markup
设置到标记的第一行,并调用\u feed
\u提要
在第一行没有看到有趣的meta
标记,因此成功完成
构造函数认为我们已经完成了解析,所以它将self.markup
设置回None
并返回
您在BeautifulSoup
对象上调用feed
,该对象直接进入SGMLParser.feed
实现,该实现不会被BeautifulSoup
覆盖
它看到一个有趣的meta
标记,并调用\u feed
以这种新编码解析文档
\u提要
尝试使用self.markup
构建一个UnicodeDammit
对象
它爆炸了,因为self.markup
是None
,因为它认为它只会在BeautifulSoup
的构造函数中的那一小段时间内被调用
这个故事的寓意是,feed
是一种不受支持的向BeautifulSoup
发送输入的方式。您必须一次将所有输入传递给它
至于为什么BeautifulSoup(open(mypath,“r”)
返回None
,我不知道;我没有看到在BeautifulSoup
上定义的\uuu new\uuu
,因此它似乎必须返回BeautifulSoup
对象
综上所述,您可能需要考虑使用BeautifulSoup 4而不是3。为了支持Python3,它必须删除对SGMLParser
的依赖,如果在重写过程中,您遇到的任何错误都得到了修复,我也不会感到惊讶。为什么不使用bs4?IDK它是什么,但切换到bs4成功了。谢谢你,先生,我认为你的回答将有助于其他人在将来用谷歌搜索这个问题。
Traceback (most recent call last):
File "test_docs.py", line 28, in <module>
newdoc.get_doc_ondrive(testeee)
File "/home/jddancks/Capstone/Python/code/pkg/vectors/DOCUMENT.py", line 117, in get_doc_ondrive
self.soup.feed(line)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 139, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 298, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 348, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 385, in handle_starttag
method(attrs)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1618, in start_meta
self._feed(self.declaredHTMLEncoding)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1172, in _feed
smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1776, in __init__
self._detectEncoding(markup, isHTML)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1922, in _detectEncoding
'^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer
<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n