使用python LXML从html网页提取信息
我正试图用有限的知识制作一个python脚本,从网页中获取特定的信息。但我想我有限的知识是不够的。 我需要提取7-8条信息。标签如下: 一,使用python LXML从html网页提取信息,python,html,beautifulsoup,lxml,python-requests,Python,Html,Beautifulsoup,Lxml,Python Requests,我正试图用有限的知识制作一个python脚本,从网页中获取特定的信息。但我想我有限的知识是不够的。 我需要提取7-8条信息。标签如下: 一, 您可以使用lxml和csv模块来执行您想要的操作。lxml支持xpath表达式来选择所需的元素 from lxml import etree from StringIO import StringIO from csv import DictWriter f= StringIO(''' <html><body> &
您可以使用lxml和csv模块来执行您想要的操作。lxml支持xpath表达式来选择所需的元素
from lxml import etree
from StringIO import StringIO
from csv import DictWriter
f= StringIO('''
<html><body>
<a class="ui-magnifier-glass"
href="here goes the link that i want to extract"
data-spm-anchor-id="0.0.0.0"
style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
></a>
<a href="link to extract"
title="title to extract"
rel="category tag"
data-spm-anchor-id="0.0.0.0"
>or maybe this word instead of title</a>
</body></html>
''')
doc = etree.parse(f)
data=[]
# Get all links with data-spm-anchor-id="0.0.0.0"
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')
# Iterate thru each element containing an <a></a> tag element
for elem in r:
# You can access the attributes with get
link=elem.get('href')
title=elem.get('title')
# and the text inside the tag is accessable with text
text=elem.text
data.append({
'link': link,
'title': title,
'text': text
})
with open('file.csv', 'w') as csvfile:
fieldnames=['link', 'title', 'text']
writer = DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow(row)
下面是如何使用lxml和curl的一些东西通过id进行extact: extract.py:
from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)
some.html:
<html>
<body>
<div id="nope">nope</div>
<div id="postingbody">yep</div>
</body>
</html>
另见:
非常感谢!是否有一种方法可以将不同变量中的所有数据相加到字典或列表中。然后将其附加到csv?它已经这样做了。为了清晰起见,我添加了更多注释并对其进行了重构。如果您还没有这样做,那么应该在交互式python下运行它。它允许您逐行查看正在发生的事情,并检查中间状态。是的,我已经运行了代码。但问题是它在csvMaybe中添加了3行相同的数据,这可能是因为数据是一个列表,并且被用作字典?该示例查找属性数据spm锚id=0.0.0.0的所有a元素。因为有两个元素,所以有相应数量的数据行。第一行是标题行,它告诉您列包含哪些内容,删除writer.writeheader可以忽略这些内容。您是否希望使用Beautifulsoup进行解析,因为您在此处标记了它?我认为到目前为止,使用Beautifulsoup进行解析是最简单的。
from lxml import etree
from StringIO import StringIO
from csv import DictWriter
f= StringIO('''
<html><body>
<a class="ui-magnifier-glass"
href="here goes the link that i want to extract"
data-spm-anchor-id="0.0.0.0"
style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
></a>
<a href="link to extract"
title="title to extract"
rel="category tag"
data-spm-anchor-id="0.0.0.0"
>or maybe this word instead of title</a>
</body></html>
''')
doc = etree.parse(f)
data=[]
# Get all links with data-spm-anchor-id="0.0.0.0"
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')
# Iterate thru each element containing an <a></a> tag element
for elem in r:
# You can access the attributes with get
link=elem.get('href')
title=elem.get('title')
# and the text inside the tag is accessable with text
text=elem.text
data.append({
'link': link,
'title': title,
'text': text
})
with open('file.csv', 'w') as csvfile:
fieldnames=['link', 'title', 'text']
writer = DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow(row)
curl some.html | python extract.py
from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)
<html>
<body>
<div id="nope">nope</div>
<div id="postingbody">yep</div>
</body>
</html>