Python 从HTML中提取标题不起作用_Python_Html_Python 3.x_Beautifulsoup

Python 从HTML中提取标题不起作用

python html python-3.x

Python 从HTML中提取标题不起作用,python,html,python-3.x,beautifulsoup,Python,Html,Python 3.x,Beautifulsoup,我正在对从古腾堡下载的大量小说进行文本分析。我想保留尽可能多的元数据，所以我以html的形式下载，然后再转换成文本。我的问题是从html文件中提取元数据，特别是每本小说的标题现在，我正在使用BeautifulSoup生成文本文件并提取标题。对于《简·爱》的示例文本，我的代码如下： from bs4 import BeautifulSoup ### Opens html file html = open("filepath/Jane_Eyre.htm") ### Cleans html fi

我正在对从古腾堡下载的大量小说进行文本分析。我想保留尽可能多的元数据，所以我以html的形式下载，然后再转换成文本。我的问题是从html文件中提取元数据，特别是每本小说的标题

现在，我正在使用BeautifulSoup生成文本文件并提取标题。对于《简·爱》的示例文本，我的代码如下：

from bs4 import BeautifulSoup

### Opens html file
html = open("filepath/Jane_Eyre.htm")

### Cleans html file
soup = BeautifulSoup(html, 'lxml')

title_data = soup.title.string

但是，当我执行此操作时，会出现以下错误：

AttributeError: 'NoneType' object has no attribute 'string'

title

标记肯定存在于原始html中；打开文件时，我在前几行看到：

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII" />
<title>Jane Eyre</title>
<style type="text/css">


简爱

有人对我做错了什么有什么建议吗？

试试这个：

title_data = soup.find(".//title").text

或

您可以使用其他BS4方法，例如：

title_data = soup.find('title').get_text()

尝试使用

html.parser

而不是

lxml

e、 g:

你的

html

标记有一个名称空间，因此如果你试图用

lxml

解析它，你应该尊重名称空间。

OP here。对于前来寻求解决方案的任何人，以下是我所做的工作。这有点麻烦，但它让我找到了我需要去的地方：

from bs4 import BeautifulSoup
import re

### Opens html file
html = open("/filepath/Jane_Eyre.htm")

### Cleans html file
soup = BeautifulSoup(html, 'html.parser')


title = re.findall(r'<title>(.*?)</title>',soup.get_text())

print(title)

从bs4导入美化组
进口稀土
###打开html文件
html=open（“/filepath/Jane_-Eyre.htm”）
###清除html文件
soup=BeautifulSoup（html，'html.parser'）
title=re.findall（r'（.*），soup.get_text（））
印刷品（标题）

我不知道为什么标题标签在get_text（）版本中有效，而在html中无效；我还认识到正则表达式对于html解析来说是次优的。但是它为我解决了这个问题。

为什么不简单地使用

lxml

from lxml import html
page = html.fromstring(source_string)
title = page.xpath("/title/text()")[0]

下面的方法可以从古腾堡电子书的html文件中提取标题

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required = soup.find_all("span", {"class": "title"})
>>> x1 = []
>>> for i in required:
...     x1.append(i.get_text())
...
>>> for i in x1:
...     print(i)
...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>

最近我发现BeautifulSoup有很多问题。也许只是我，但我喜欢自己动手刮东西。建议您使用

re

regex库为pythonStill构建scraser，使用这两种方法中的任何一种都会得到相同的错误消息。谢谢你的回复。恐怕还是没什么好高兴的：即使解析器被更改了，也要继续得到“'NoneType'对象没有属性'string'”的回复。还是谢谢你！更改解析器不会更改输出。此外，lxml的速度更快。在初始化BeautifulSoup后，请尝试

打印soup

，您应该可以在那里获得整个HTML文档。确实如此，但它非常庞大–这里有整个小说。如果你想让我检查标签的话，我似乎无法访问开始。

from lxml import html
page = html.fromstring(source_string)
title = page.xpath("/title/text()")[0]

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required = soup.find_all("span", {"class": "title"})
>>> x1 = []
>>> for i in required:
...     x1.append(i.get_text())
...
>>> for i in x1:
...     print(i)
...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>