使用python LXML从html网页提取信息_Python_Html_Beautifulsoup_Lxml_Python Requests

使用python LXML从html网页提取信息

python html

使用python LXML从html网页提取信息,python,html,beautifulsoup,lxml,python-requests,Python,Html,Beautifulsoup,Lxml,Python Requests,我正试图用有限的知识制作一个python脚本，从网页中获取特定的信息。但我想我有限的知识是不够的。我需要提取7-8条信息。标签如下：一, 您可以使用lxml和csv模块来执行您想要的操作。lxml支持xpath表达式来选择所需的元素 from lxml import etree from StringIO import StringIO from csv import DictWriter f= StringIO(''' <html><body> &

我正试图用有限的知识制作一个python脚本，从网页中获取特定的信息。但我想我有限的知识是不够的。我需要提取7-8条信息。标签如下：

一,

您可以使用lxml和csv模块来执行您想要的操作。lxml支持xpath表达式来选择所需的元素

from lxml import etree
from StringIO import StringIO
from csv import DictWriter

f= StringIO('''
    <html><body>
    <a class="ui-magnifier-glass" 
       href="here goes the link that i want to extract" 
       data-spm-anchor-id="0.0.0.0" 
       style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
    ></a>
    <a href="link to extract"
       title="title to extract" 
       rel="category tag" 
       data-spm-anchor-id="0.0.0.0"
    >or maybe this word instead of title</a>
    </body></html>
''')
doc = etree.parse(f)

data=[]
# Get all links with data-spm-anchor-id="0.0.0.0" 
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')

# Iterate thru each element containing an <a></a> tag element
for elem in r:
    # You can access the attributes with get
    link=elem.get('href')
    title=elem.get('title')
    # and the text inside the tag is accessable with text
    text=elem.text

    data.append({
        'link': link,
        'title': title,
        'text': text
    })

with open('file.csv', 'w') as csvfile:
    fieldnames=['link', 'title', 'text']
    writer = DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in data:
        writer.writerow(row)

下面是如何使用lxml和curl的一些东西通过id进行extact：

extract.py：

from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)

some.html：

<html>
    <body>
        <div id="nope">nope</div>
        <div id="postingbody">yep</div>
    </body>
</html>

另见：

非常感谢！是否有一种方法可以将不同变量中的所有数据相加到字典或列表中。然后将其附加到csv？它已经这样做了。为了清晰起见，我添加了更多注释并对其进行了重构。如果您还没有这样做，那么应该在交互式python下运行它。它允许您逐行查看正在发生的事情，并检查中间状态。是的，我已经运行了代码。但问题是它在csvMaybe中添加了3行相同的数据，这可能是因为数据是一个列表，并且被用作字典？该示例查找属性数据spm锚id=0.0.0.0的所有a元素。因为有两个元素，所以有相应数量的数据行。第一行是标题行，它告诉您列包含哪些内容，删除writer.writeheader可以忽略这些内容。您是否希望使用Beautifulsoup进行解析，因为您在此处标记了它？我认为到目前为止，使用Beautifulsoup进行解析是最简单的。

from lxml import etree
from StringIO import StringIO
from csv import DictWriter

f= StringIO('''
    <html><body>
    <a class="ui-magnifier-glass" 
       href="here goes the link that i want to extract" 
       data-spm-anchor-id="0.0.0.0" 
       style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
    ></a>
    <a href="link to extract"
       title="title to extract" 
       rel="category tag" 
       data-spm-anchor-id="0.0.0.0"
    >or maybe this word instead of title</a>
    </body></html>
''')
doc = etree.parse(f)

data=[]
# Get all links with data-spm-anchor-id="0.0.0.0" 
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')

# Iterate thru each element containing an <a></a> tag element
for elem in r:
    # You can access the attributes with get
    link=elem.get('href')
    title=elem.get('title')
    # and the text inside the tag is accessable with text
    text=elem.text

    data.append({
        'link': link,
        'title': title,
        'text': text
    })

with open('file.csv', 'w') as csvfile:
    fieldnames=['link', 'title', 'text']
    writer = DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in data:
        writer.writerow(row)

curl some.html | python extract.py

from lxml import etree
import sys
# grab all elements with id == 'postingbody'
pb = etree.HTML(sys.stdin.read()).xpath("//*[@id='postingbody']")
print(pb)

<html>
    <body>
        <div id="nope">nope</div>
        <div id="postingbody">yep</div>
    </body>
</html>