Python 从HTML中提取标题不起作用

Python 从HTML中提取标题不起作用,python,html,python-3.x,beautifulsoup,Python,Html,Python 3.x,Beautifulsoup,我正在对从古腾堡下载的大量小说进行文本分析。我想保留尽可能多的元数据,所以我以html的形式下载,然后再转换成文本。我的问题是从html文件中提取元数据,特别是每本小说的标题 现在,我正在使用BeautifulSoup生成文本文件并提取标题。对于《简·爱》的示例文本,我的代码如下: from bs4 import BeautifulSoup ### Opens html file html = open("filepath/Jane_Eyre.htm") ### Cleans html fi

我正在对从古腾堡下载的大量小说进行文本分析。我想保留尽可能多的元数据,所以我以html的形式下载,然后再转换成文本。我的问题是从html文件中提取元数据,特别是每本小说的标题

现在,我正在使用BeautifulSoup生成文本文件并提取标题。对于《简·爱》的示例文本,我的代码如下:

from bs4 import BeautifulSoup

### Opens html file
html = open("filepath/Jane_Eyre.htm")

### Cleans html file
soup = BeautifulSoup(html, 'lxml')

title_data = soup.title.string
但是,当我执行此操作时,会出现以下错误:

AttributeError: 'NoneType' object has no attribute 'string'
title
标记肯定存在于原始html中;打开文件时,我在前几行看到:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII" />
<title>Jane Eyre</title>
<style type="text/css">

简爱
有人对我做错了什么有什么建议吗?

试试这个:

title_data = soup.find(".//title").text


您可以使用其他BS4方法,例如:

title_data = soup.find('title').get_text()

尝试使用
html.parser
而不是
lxml

e、 g:


你的
html
标记有一个名称空间,因此如果你试图用
lxml
解析它,你应该尊重名称空间。

OP here。对于前来寻求解决方案的任何人,以下是我所做的工作。这有点麻烦,但它让我找到了我需要去的地方:

from bs4 import BeautifulSoup
import re

### Opens html file
html = open("/filepath/Jane_Eyre.htm")

### Cleans html file
soup = BeautifulSoup(html, 'html.parser')


title = re.findall(r'<title>(.*?)</title>',soup.get_text())

print(title)
从bs4导入美化组
进口稀土
###打开html文件
html=open(“/filepath/Jane_-Eyre.htm”)
###清除html文件
soup=BeautifulSoup(html,'html.parser')
title=re.findall(r'(.*),soup.get_text())
印刷品(标题)

我不知道为什么标题标签在get_text()版本中有效,而在html中无效;我还认识到正则表达式对于html解析来说是次优的。但是它为我解决了这个问题。

为什么不简单地使用
lxml

from lxml import html
page = html.fromstring(source_string)
title = page.xpath("/title/text()")[0]

下面的方法可以从古腾堡电子书的html文件中提取标题

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required = soup.find_all("span", {"class": "title"})
>>> x1 = []
>>> for i in required:
...     x1.append(i.get_text())
...
>>> for i in x1:
...     print(i)
...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>

最近我发现BeautifulSoup有很多问题。也许只是我,但我喜欢自己动手刮东西。建议您使用
re
regex库为pythonStill构建scraser,使用这两种方法中的任何一种都会得到相同的错误消息。谢谢你的回复。恐怕还是没什么好高兴的:即使解析器被更改了,也要继续得到“'NoneType'对象没有属性'string'”的回复。还是谢谢你!更改解析器不会更改输出。此外,lxml的速度更快。在初始化BeautifulSoup后,请尝试
打印soup
,您应该可以在那里获得整个HTML文档。确实如此,但它非常庞大–这里有整个小说。如果你想让我检查标签的话,我似乎无法访问开始。
from lxml import html
page = html.fromstring(source_string)
title = page.xpath("/title/text()")[0]
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required = soup.find_all("span", {"class": "title"})
>>> x1 = []
>>> for i in required:
...     x1.append(i.get_text())
...
>>> for i in x1:
...     print(i)
...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>