Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/360.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 忽略原始文本组中的标题文本_Python_Html_Beautifulsoup - Fatal编程技术网

Python 忽略原始文本组中的标题文本

Python 忽略原始文本组中的标题文本,python,html,beautifulsoup,Python,Html,Beautifulsoup,我正在使用BeautifulSoup4处理html页面html文件顶部确实包含请求头信息,我如何过滤掉这些信息 下面是html文件片段 WARC/1.0 WARC-Type: response WARC-Date: 2012-02-17T03:07:46Z WARC-TREC-ID: clueweb12-0206wb-51-29582 WARC-Record-ID: <urn:uuid:546b127c-040e-4dee-a565-3a3f6683f898> Content-Typ

我正在使用
BeautifulSoup4
处理html页面
html
文件顶部确实包含
请求头
信息,我如何过滤掉这些信息

下面是
html
文件片段

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-17T03:07:46Z
WARC-TREC-ID: clueweb12-0206wb-51-29582
WARC-Record-ID: <urn:uuid:546b127c-040e-4dee-a565-3a3f6683f898>
Content-Type: application/http; msgtype=response
Content-Length: 29032

HTTP/1.1 200 OK
Cache-Control: private
Connection: close
Date: Fri, 17 Feb 2012 03:07:48 GMT
Content-Length: 28332
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie:         chkvalues=ClmZLoF4xnHoBwiZnWFzYcCMoYB/fMxYfeeJl/zhlypgwivOzw6qnVBRWzf8f19O; expires=Wed, 15-Aug-2012 02:07:48 GMT; path=/
Set-Cookie: previous-category-id=11; expires=Fri, 17-Feb-2012 03:27:48
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head id="ctl00_headTag"><title>
原始文本中
填充
“WARC/1.0\r\nARC类型:响应\r\nARC日期:2012-02-17T03:07:46Z….
与信息类似,表示它正在将标题添加到
原始文本中


如何从原始文本中删除标题。

HTTP标题与正文由两行新行分隔,因此您可以使用
\r\n\r\n
分割数据。但是,您的文件同时包含请求和响应,并且更容易使用正文开头作为分隔符

try:
    contents = contents[contents.index('<!DOCTYPE'):]
except ValueError:
    contents = contents[contents.index('<html'):]
soup = BeautifulSoup(contents, "lxml") 
试试看:
contents=contents[contents.index('
'\n'.join([e代表原始文本中的e.split('\n'))如果(e和e[0]=”
try:
    contents = contents[contents.index('<!DOCTYPE'):]
except ValueError:
    contents = contents[contents.index('<html'):]
soup = BeautifulSoup(contents, "lxml") 
'\n'.join([e for e in raw_text.split('\n') if (e and e[0]=="<")])