编写一个python脚本,根据任何类从HTML中找出XPath

编写一个python脚本,根据任何类从HTML中找出XPath,python,web-scraping,scripting,Python,Web Scraping,Scripting,在Python中,我希望用户在控制台提示符中输入URL(获取输入并将其存储在某个变量中), 例如,如果网页包含以下HTML: <html> <head> </head> <body> <div> <h1 class="class_one">First heading</h1> <p>Some text</p>

在Python中,我希望用户在控制台提示符中输入URL(获取输入并将其存储在某个变量中), 例如,如果网页包含以下HTML:

<html>
<head>
</head>
    <body>
        <div>
            <h1 class="class_one">First heading</h1>
                <p>Some text</p>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h3 class="class_three">
                            </h3>
                        </center>
                        <center>
                            <h3 class="find_first_class">
                                Some text
                            </h3>
                        </center>
                    </div>
                </div>
            </div>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h2 class="find_second_class">
                            </h2>
                        </center>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>

这是Orhan提到的try/except逻辑。lxml解析传递给它的文档,可以通过xpath引用元素并提取类。之后,只需简单检查它们是否出现在所需的类中。lxml还允许通过ElementTree重建初始xpath

import csv
import requests
from lxml import etree

target_url = input('Which url is to be scraped?')

page = '''
<html>
<head>
</head>
    <body>
        <div>
            <h1 class="class_one">First heading</h1>
                <p>Some text</p>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h3 class="class_three">
                            </h3>
                        </center>
                        <center>
                            <h3 class="find_first_class">
                                Some text
                            </h3>
                        </center>
                    </div>
                </div>
            </div>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h2 class="find_second_class">
                            </h2>
                        </center>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>
'''

#response = requests.get(target_url)
#document = etree.parse(response.content)
classes_list = ['find_first_class', 'find_second_class']
expressions = []

document = etree.fromstring(page)

for element in document.xpath('//*'):
    try:
        ele_class = element.xpath("@class")[0]
        print(ele_class)
        if ele_class in classes_list:
            tree = etree.ElementTree(element)
            expressions.append((ele_class, tree.getpath(element)))
    except IndexError:
        print("No class in this element.")
        continue

with open('test.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(expressions)
导入csv
导入请求
从lxml导入etree
target_url=input('要刮取哪个url?'))
页码=“”
第一标题
一些文本

一些文本 ''' #response=requests.get(目标url) #document=etree.parse(response.content) 类列表=[“查找第一类”、“查找第二类”] 表达式=[] document=etree.fromstring(第页) 对于document.xpath('/*')中的元素: 尝试: ele_class=element.xpath(“@class”)[0] 打印(ele_类) 如果类别列表中的元素类别: tree=etree.ElementTree(元素) expressions.append((ele_类,tree.getpath(元素))) 除索引器外: 打印(“此元素中没有类。”) 持续 以open('test.csv','w')作为f: writer=csv.writer(f,分隔符=',') writer.writerows(表达式)
对于任何看起来类似的人,我是如何做到这一点的:

import os
import csv
import requests
from lxml import html


csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
    reader = csv.reader(f)
    data = [temp_reader for temp_reader in reader]
    final_result = {}

    for item_list in data:
        class_locations = {}
        url = item_list[0]
        item_list.pop(0)
        try:
            page = requests.get(url)
            root = html.fromstring(page.text)
            tree = root.getroottree()
            for find_class in item_list:
                find_class_locations =[]
                try:
                    result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
                    for r in result:
                        find_class_locations.append(tree.getpath(r))
                    class_locations[find_class] = find_class_locations
                except Exception,e:
                    print(e)
                    continue
            final_result[url] = class_locations
        except Exception, e:
            print(e)
            continue
print(final_result)

您可以在第一堂课和第二堂课中使用简单的“try-Exception”blok中的xpath。我认为无需浪费时间来检查网页中的特殊内容。@OrhanSolak感谢您的反馈,问题是我从未使用过python和web scraping,我想学习它,但现在我正在寻找一些脚本(由于工作的紧迫性)从我可以开始的地方开始,我以后一定会深入挖掘。这可能会有所帮助。即使你只阅读突出显示的部分,你也能在几分钟内完成你的工作。@OrhanSolak非常感谢,我读了它,但我找不到如何找到该节点的xpath,比如:/div[1]/div[2]/center[2]/h3[1],你能告诉我如何找到xpathTry:$x(“规范化空间(//div[@class='classu-two']]//div[@class='classu-one']//h3[@class='find_first_class']))如果将此代码复制粘贴到控制台(在chrome中按F12),输出将是h3之间的一些文本。顺便说一下,normalizespace只是将输出显示为可读的文本格式,在用python编写代码时不要使用这一部分
import os
import csv
import requests
from lxml import html


csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
    reader = csv.reader(f)
    data = [temp_reader for temp_reader in reader]
    final_result = {}

    for item_list in data:
        class_locations = {}
        url = item_list[0]
        item_list.pop(0)
        try:
            page = requests.get(url)
            root = html.fromstring(page.text)
            tree = root.getroottree()
            for find_class in item_list:
                find_class_locations =[]
                try:
                    result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
                    for r in result:
                        find_class_locations.append(tree.getpath(r))
                    class_locations[find_class] = find_class_locations
                except Exception,e:
                    print(e)
                    continue
            final_result[url] = class_locations
        except Exception, e:
            print(e)
            continue
print(final_result)