Python BeautifulSoup仅返回head标记内的内容_Python_Url_Web Crawler_Beautifulsoup - Fatal编程技术网

Python BeautifulSoup仅返回head标记内的内容

python url web-crawler

Python BeautifulSoup仅返回head标记内的内容,python,url,web-crawler,beautifulsoup,Python,Url,Web Crawler,Beautifulsoup,我正在与BeautifulSoup合作，要么是我发现了一个bug，要么是我犯了一个错误。在我的例子中，我抓取了纽约时报的一个子网站 import urllib2 from bs4 import BeautifulSoup website = "http://www.nytimes.com/pages/politics/index.html" data = BeautifulSoup(urllib2.urlopen(website).read()) print data 当我运行代码时，返回的是

我正在与BeautifulSoup合作，要么是我发现了一个bug，要么是我犯了一个错误。在我的例子中，我抓取了纽约时报的一个子网站

import urllib2
from bs4 import BeautifulSoup
website = "http://www.nytimes.com/pages/politics/index.html"
data = BeautifulSoup(urllib2.urlopen(website).read())
print data

当我运行代码时，返回的是head标记和其中的内容。但是，它不会获取body标签中的内容。如果我将网站url更改为

http://www.nytimes.com

然后BS返回整页源代码。这里发生了什么，为什么我在爬行时没有得到身体标签

http://www.nytimes.com/pages/politics/index.html

这不是BeautifulSoup中的错误。问题实际上在于，bs4使用内置的HTMLPasser，它对格式错误的HTML不太宽容，正如图所示，HTML确实格式错误，并且几乎没有未关闭、散乱和错位的标记，导致HTMLPasser和随后的BeautifulSoup突然停止解析

以下针对BeautifulSoup提交的bug中已解释了此问题

无法复制。当我运行这段代码时，它会得到完整的页面，而不仅仅是

标题标签。。。问题中的代码与我的文件中的内容一字不差。我会说它还抓取了html
标记。这就像是body
标签不存在一样。您使用的是什么版本的BeautifulSoup？需要明确的是，如果在上面的代码（并且仅在上面的代码）之后打印data.body
，它将打印None
？（对我来说，它打印主体
标记的内容）我有BS4，是的，它打印无
。有趣。。。旧版本没有这个错误，所以这看起来像个bug。




[url]相关文章推荐



                                                        
非'；没有URL吗？
url 
Joomla URL：一篇文章没有'；你自己没有一个漂亮的URL吗？
urljoomla 
用户配置文件/帐户URL
urlweb-applications 
cakephp隐藏登录url
urlcakephpauthentication 
表达式引擎中新闻的自定义URL格式
url 
获取url，直到找到上下文路径
urljsf 
当解析url时，scrapy选择器返回null，但在解析保存的url时返回ok
urlxpathscrapyweb-crawler 
Url 使用散列部分（锚定）重定向301#
urlredirecthashurl-rewriting 
Nginx语言url重写
urlnginx 
我可以重定向到JSF中包含unicode字符的URL吗？
urlunicode 
Url 您正在注册以下域名->；将此域转发到。。什么
urldnsweb 
Tumblr自定义URL不允许我保存
url 
无效参数：url
url 
第页中的地址栏重复
< >我想把地址栏复制到页面中间的一个输入字段中。我试过使用：
<form>
      <input type="url">
</form>
url 
重定向DotNetNuke中的旧URL
urlredirectdotnetnuke 
Url 转到某些编辑页面时出现Kohana 404错误
urlrouting 
使用urlwatch查看页面的一部分（无ID）
url 
url（url）-浏览器不使用'；t在回拨功能中打开
urlbrowser 
如何在新选项卡中打开Reporting Services报表中存储的数据库URL
urlreporting-servicesservice 
Url 更新记录后如何返回上一页？
urlyii2 
                                       





随机文章推荐



                                                        
如何在Teradata查询报表中基于条件语句创建新列？
teradata 
如何在Teradata中分组子字符串？
teradata 
Teradata Bteq登录失败
teradata 
如何获得Teradata中的表大小？
teradata 
如何使用HP UFT连接teradata
teradata 
Teradata：为历史数据创建行和内容
teradata 
Teradata SQL将吸引24岁的客户
teradata


                                        

                                        
                                        


                                                
                                                        [python]相关推荐
                                                        
为什么我'；我的代码中出现了这个Python错误？
导入urllib，urllib2
def URLRequest（url，参数，method=“GET”）：
如果方法==“POST”：
返回urllib2.Request（url，data=urllib.encode（params））
									Python
							 
Python中字符串上的高效变量字节迭代
									Python
							 									File Io
							 
Python 文件I/O已导致“错误”；[冲突]”；
									Python
							 									Python 3.x
							 									File Io
							 
Python 使用内置循环将数据拟合到函数
									Python
							 									Python 2.7
							 
Python Beautiful Soup，以原始编码保存的文本未正确显示
									Python
							 									Encoding
							 
在Python2.7中将4字节块转换为相应的int
									Python
							 									Python 2.7
							 
python：将值与字典中的其他值进行比较
									Python
							 									Loops
							 									Dictionary
							 									Random
							 
Python Pycharm社区版4.5-导入包/模块
									Python
							 									Import
							 									Module
							 									Pycharm
							 
Python 将数字字符串转换为datetime对象
									Python
							 									Datetime
							 
在Python中向字典添加元素
									Python
							 									Python 2.7
							 									Dictionary
							 
Python 是否可以让用户从表单创建和执行数据库迁移？
									Python
							 									Flask
							 									Sqlalchemy
							 
运行涉及协同程序的python代码段时出错
									Python
							 
Python Tkinter将视频插入窗口
									Python
							 									Video
							 									Tkinter
							 
Python套接字编程："；“地址已在使用中”；例外之后
									Python
							 									Sockets
							 									Python 3.x
							 
Python 在OpenVPN客户端运行时将套接字绑定到本地IP地址块
									Python
							 									Linux
							 									Sockets
							 									Networking
							 
Python 用于热图的数据透视表
									Python
							 									Pandas
							 
（Python）没有错误，但程序没有运行
									Python
							 
使python代码兼容2.7和3.6+；版本-关于队列模块
									Python
							 									Python 3.x
							 									Python 2.7
							 
Python 如何从列表生成字典并删除所有空白
									Python
							 									Bash
							 
Python 按索引n调用列表列表中的元素
									Python
							 
Python 我需要使用用户定义的函数来创建列表
									Python
							 									Python 3.x
							 
Python 基于pandas的时间序列计算
									Python
							 									Pandas
							 									Function
							 									Loops
							 
如何在Python中实现嵌套列表理解
									Python
							 
如何使用.format（）将Python字符串格式化为类似于字典的格式？
									Python
							 									String
							 									Printing
							 									Formatting
							 
Python 将较小矩阵的多个副本制作为较大矩阵
									Python
							 									Numpy
							 
Python n意外标记“；“报表结束”；发现如下所述：；创建；
									Python
							 									Sql
							 									Jenkins
							 									Db2
							 
Python 如何使用pyhdb执行sql文件？
									Python
							 
为什么Python代码编写的东西很奇怪？
									Python
							 									File
							 
Python tkinter，无法从外部调用函数'；s级
									Python
							 									User Interface
							 									Tkinter
							 
Python 有没有办法使用整数编写if-in语句？
									Python
							 									Python 3.x
							 
                                                        
                                                

                                                
                                                        Tags
                                                        
Sqlalchemy
Windows
Batch File
Facebook
Encoding
Ruby On Rails
Zend Framework
Fluent Nhibernate
Open Source
Hive
Unity3d
Express
Wcf
Visual Studio 2008
Actionscript
Silverlight 4.0
Vim
.net Core
Ide
Graph
Kernel
Drupal 6
Openstack
Database
Lucene
D
Smalltalk
Reporting Services
Xpath
Asp.net
Video Streaming
Calendar
Documentation
Dart
Checkbox
Websocket
Button
Google App Engine
Google App Maker
Filesystems
Jsf 2
Vmware
Map
Vb.net
Animation
Ruby On Rails 3.2
Gatsby
Gwt
Swift3
Sprite Kit
Logging
Reactjs
Three.js
Kdb
Kubernetes
Actions On Google
Camera
Html5 Canvas
Python Sphinx
Electron
Netbeans
File Upload
Azure Ad B2c
Exception Handling
Ethereum
C++ Cli
Redux
Matrix
Xml
Web Applications
Tinymce
Autodesk Forge
Azure Functions
Ipython
Loopbackjs
Stata
Dotnetnuke
Codeigniter
Monitoring
Prolog
Gmail
Data Structures
Jhipster
Coldfusion
Ldap
Authentication
Phpmyadmin
Passwords
Curl
Git
Appium
Security
Web Scraping
Akka
Biztalk
Google Cloud Storage
Jsf
Cluster Computing
Internationalization
Ssh
Formatting
Windows 10
Dom
Embedded
Go
Graphql
Dependency Injection
Loops
Plot
Cygwin
Symfony1
Wso2
Spring Security
Blazor
Binding
Listview
Visual Studio 2017
Binary
Raspberry Pi
Ant
Opengl
Ms Access
Ssis
Nginx
Sencha Touch
Artificial Intelligence
Firefox Addon
Netlogo
C++11
Pointers
Apache Nifi
Hash
Rust
Session
Functional Programming
Templates
Time
Mongoose
Openssl
Oracle Apex
Erlang
Spring Boot
Breeze
Twig
Vagrant
Sharepoint 2007
Path
Asterisk
Ajax
Cobol
Azure Cosmosdb
Responsive Design
Ckeditor
Jquery Plugins
X86
Jquery Mobile
Spring Cloud
Coding Style
Material Ui
Django Models
Axapta
Linkedin
Java Me
Laravel 5
Javascript
Process
Datetime
Soap
Testng
System Verilog
Antlr4
Android
Select
Bootstrap 4
Sharepoint
Audio
Cypress
Arduino
Api
Cors
Processing
Asp.net Mvc 2
Json
Apache Spark
Frameworks
Windows Phone 8
Sass
Junit
Grid
Wicket
Regex
Nativescript
Bluetooth
Drupal 7
Discord
Windows 7
Uitableview
Visual Studio 2010
Angular
Drools
Ubuntu


                

                        
						
                        
                                
                                        
                                                
                                                        
                                                                Copyright © 2024. All Rights Reserved by  - Fatal编程技术网