Python 使用BeautifulSoup解析HTML时遇到问题_Python_Beautifulsoup

Python 使用BeautifulSoup解析HTML时遇到问题

python

Python 使用BeautifulSoup解析HTML时遇到问题,python,beautifulsoup,Python,Beautifulsoup,我正在尝试使用BeautifulSoup来解析Python中的一些HTML。具体来说，我尝试创建两个soup对象数组：一个用于在网站上发布的日期，另一个用于发布本身。但是，当我在与帖子匹配的div类上使用findAll时，只返回初始标记，而不返回标记内的文本。另一方面，我的代码在日期上运行良好。发生了什么事 # store all texts of posts texts = soup.findAll("div", {"class":"quote"}) # store all dates of

我正在尝试使用BeautifulSoup来解析Python中的一些HTML。具体来说，我尝试创建两个soup对象数组：一个用于在网站上发布的日期，另一个用于发布本身。但是，当我在与帖子匹配的div类上使用findAll时，只返回初始标记，而不返回标记内的文本。另一方面，我的代码在日期上运行良好。发生了什么事

# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})

# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})

上面的第一行仅返回

<div class="quote">

这不是我想要的。第二行返回

<div class="datetab">Feb<span>2</span></div>

Feb2

这就是我想要的（预精炼）

我不知道我做错了什么。是我试图解析的网站。这是家庭作业，我真的很绝望。

那个网站是由Tumblr提供动力的。Tumblr有。

该站点由Tumblr供电。Tumblr有。

有一个可以用来阅读文档的工具

from tumblr import Api

api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
   #do something here

对于伪造的findAll，如果没有程序的实际源代码，很难看出哪里出了问题

有一种可以用来阅读文档的方法

from tumblr import Api

api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
   #do something here

对于伪造的findAll，如果没有程序的实际源代码，很难看出哪里出了问题

您使用的是哪个版本的BeautifulSoup？版本3.1.0，实际HTML（读取：无效HTML）低于3.0.8。此代码适用于3.0.8：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
    print incident.contents

您使用的是哪个版本的BeautifulSoup？版本3.1.0，实际HTML（读取：无效HTML）低于3.0.8。此代码适用于3.0.8：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
    print incident.contents