Python web抓取，涉及带有属性的HTML标记_Python_Beautifulsoup_Lxml_Screen Scraping

Python web抓取，涉及带有属性的HTML标记

python

Python web抓取，涉及带有属性的HTML标记,python,beautifulsoup,lxml,screen-scraping,Python,Beautifulsoup,Lxml,Screen Scraping,我正在尝试制作一个web scraper，它将解析出版物的网页并提取作者。网页的框架结构如下所示： <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </t

我正在尝试制作一个web scraper，它将解析出版物的网页并提取作者。网页的框架结构如下所示：

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

我意识到很多导入语句可能是多余的，但我只是复制了更多源文件中当前的内容

编辑：我想我并没有说得很清楚，但我在页面中有多个标签，我想刮去它们

从您的问题中，我不清楚您为什么需要担心

div

标记--只需执行以下操作如何：

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

在您提供的HTML上，运行此命令会准确地发出：

####I want whatever is located here ###

这似乎是你想要的。也许你可以更精确地指定你需要什么，这个超级简单的片段不做——多个<代码> TD < /代码>所有的类<代码>作者<代码>，你需要考虑（所有的？只是一些？哪些？），可能丢失了任何这样的标签（在这种情况下你想做什么），诸如此类。很难仅从这个简单的示例和过多的代码中推断出您的具体规格；-）

编辑：如果根据OP的最新评论，有多个这样的td标签，每个作者一个：

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

…也就是说，一点也不难！-）

BeautifulSoup无疑是标准的HTML解析器/处理器。但是，如果您只需要匹配这类代码段，而不是构建一个完整的表示HTML的分层对象，pyparsing可以轻松定义前导和尾随HTML标记，作为创建更大搜索表达式的一部分：

from pyparsing import makeHTMLTags, withAttribute, SkipTo

author_td, end_td = makeHTMLTags("td")

# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))

search = author_td + SkipTo(end_td)("body") + end_td

for match in search.searchString(html):
    print match.body

从pyparsing导入makeHTMLTags，with属性，SkipTo
作者，结束=makeHTMLTags（“td”）
#仅对的where class=“author”感兴趣
author\u td.setParseAction（带属性（（“类”、“作者”））
search=author\u td+SkipTo（end\u td）（“body”）+end\u td
对于search.searchString（html）中的匹配项：
打印匹配体

Pyparsing的makeHTMLTags函数不仅仅是发出

“

和

”

表达式。它还处理：

标签的无壳匹配
```
”
```
语法
开始标记中的零个或多个属性
按任意顺序定义的属性
具有名称空间的属性名称
单引号、双引号或无引号中的属性值
在标记和符号之间插入空格，或在属性名“=”和值之间插入空格
属性在解析为命名结果后可访问

在考虑使用正则表达式进行HTML抓取时，这些是常见的陷阱。

或者您可以使用pyquery，因为BeautifulSoup不再是主动维护的，请参阅

首先，使用安装pyquery

easy_install pyquery

那么你的脚本就可以简单到

from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]

pyquery使用jQuery中熟悉的css选择器语法，我发现它比BeautifulSoup更直观。它在下面使用lxml，比BeautifulSoup快得多。但是BeautifulSoup是纯python，因此也可以在Google的应用程序引擎上运行

lxml库现在是python中解析html的标准。界面一开始看起来很笨拙，但它的功能非常有用

您应该让库处理xml专门化，例如那些转义的&entities

import lxml.html

html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
          <td class="author">####I want whatever is located here, eh? &iacute; ###</td>
          </tr></tbody></table></div></div></body></html>"""

root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")

print tds           # gives [<Element td at 84ee2cc>]
print tds[0].text   # what you want, including the 'í'

import lxml.html
html=”“”
####我想要这里的任何东西，嗯###
"""
root=lxml.html.fromstring（html）
tds=root.cssselect（“div#contents td.author”）
打印tds#给出[]
打印tds[0]。文本#您想要的内容，包括'i'

谢谢你，亚历克斯。我有多个作者在页面上，所以我将有多个td标签。如何迭代它们中的每一个？

import lxml.html

html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
          <td class="author">####I want whatever is located here, eh? &iacute; ###</td>
          </tr></tbody></table></div></div></body></html>"""

root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")

print tds           # gives [<Element td at 84ee2cc>]
print tds[0].text   # what you want, including the 'í'