编写一个python脚本，根据任何类从HTML中找出XPath_Python_Web Scraping_Scripting

编写一个python脚本，根据任何类从HTML中找出XPath

python web-scraping scripting

编写一个python脚本，根据任何类从HTML中找出XPath,python,web-scraping,scripting,Python,Web Scraping,Scripting,在Python中，我希望用户在控制台提示符中输入URL（获取输入并将其存储在某个变量中），例如，如果网页包含以下HTML: <html> <head> </head> <body> <div> <h1 class="class_one">First heading</h1> <p>Some text</p>

在Python中，我希望用户在控制台提示符中输入URL（获取输入并将其存储在某个变量中）， 例如，如果网页包含以下HTML:

<html>
<head>
</head>
    <body>
        <div>
            <h1 class="class_one">First heading</h1>
                <p>Some text</p>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h3 class="class_three">
                            </h3>
                        </center>
                        <center>
                            <h3 class="find_first_class">
                                Some text
                            </h3>
                        </center>
                    </div>
                </div>
            </div>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h2 class="find_second_class">
                            </h2>
                        </center>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>

这是Orhan提到的try/except逻辑。lxml解析传递给它的文档，可以通过xpath引用元素并提取类。之后，只需简单检查它们是否出现在所需的类中。lxml还允许通过ElementTree重建初始xpath

import csv
import requests
from lxml import etree

target_url = input('Which url is to be scraped?')

page = '''
<html>
<head>
</head>
    <body>
        <div>
            <h1 class="class_one">First heading</h1>
                <p>Some text</p>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h3 class="class_three">
                            </h3>
                        </center>
                        <center>
                            <h3 class="find_first_class">
                                Some text
                            </h3>
                        </center>
                    </div>
                </div>
            </div>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h2 class="find_second_class">
                            </h2>
                        </center>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>
'''

#response = requests.get(target_url)
#document = etree.parse(response.content)
classes_list = ['find_first_class', 'find_second_class']
expressions = []

document = etree.fromstring(page)

for element in document.xpath('//*'):
    try:
        ele_class = element.xpath("@class")[0]
        print(ele_class)
        if ele_class in classes_list:
            tree = etree.ElementTree(element)
            expressions.append((ele_class, tree.getpath(element)))
    except IndexError:
        print("No class in this element.")
        continue

with open('test.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(expressions)

导入csv
导入请求
从lxml导入etree
target_url=input（'要刮取哪个url？'））
页码=“”
第一标题
一些文本
一些文本
'''
#response=requests.get（目标url）
#document=etree.parse（response.content）
类列表=[“查找第一类”、“查找第二类”]
表达式=[]
document=etree.fromstring（第页）
对于document.xpath（'/*'）中的元素：
尝试：
ele_class=element.xpath（“@class”）[0]
打印（ele_类）
如果类别列表中的元素类别：
tree=etree.ElementTree（元素）
expressions.append（（ele_类，tree.getpath（元素）））
除索引器外：
打印（“此元素中没有类。”）
持续
以open（'test.csv'，'w'）作为f：
writer=csv.writer（f，分隔符='，'）
writer.writerows（表达式）

对于任何看起来类似的人，我是如何做到这一点的：

import os
import csv
import requests
from lxml import html


csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
    reader = csv.reader(f)
    data = [temp_reader for temp_reader in reader]
    final_result = {}

    for item_list in data:
        class_locations = {}
        url = item_list[0]
        item_list.pop(0)
        try:
            page = requests.get(url)
            root = html.fromstring(page.text)
            tree = root.getroottree()
            for find_class in item_list:
                find_class_locations =[]
                try:
                    result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
                    for r in result:
                        find_class_locations.append(tree.getpath(r))
                    class_locations[find_class] = find_class_locations
                except Exception,e:
                    print(e)
                    continue
            final_result[url] = class_locations
        except Exception, e:
            print(e)
            continue
print(final_result)

您可以在第一堂课和第二堂课中使用简单的“try-Exception”blok中的xpath。我认为无需浪费时间来检查网页中的特殊内容。@OrhanSolak感谢您的反馈，问题是我从未使用过python和web scraping，我想学习它，但现在我正在寻找一些脚本（由于工作的紧迫性）从我可以开始的地方开始，我以后一定会深入挖掘。这可能会有所帮助。即使你只阅读突出显示的部分，你也能在几分钟内完成你的工作。@OrhanSolak非常感谢，我读了它，但我找不到如何找到该节点的xpath，比如：/div[1]/div[2]/center[2]/h3[1]，你能告诉我如何找到xpathTry:$x（“规范化空间（//div[@class='classu-two']]//div[@class='classu-one']//h3[@class='find_first_class']））如果将此代码复制粘贴到控制台（在chrome中按F12），输出将是h3之间的一些文本。顺便说一下，normalizespace只是将输出显示为可读的文本格式，在用python编写代码时不要使用这一部分

import os
import csv
import requests
from lxml import html


csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
    reader = csv.reader(f)
    data = [temp_reader for temp_reader in reader]
    final_result = {}

    for item_list in data:
        class_locations = {}
        url = item_list[0]
        item_list.pop(0)
        try:
            page = requests.get(url)
            root = html.fromstring(page.text)
            tree = root.getroottree()
            for find_class in item_list:
                find_class_locations =[]
                try:
                    result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
                    for r in result:
                        find_class_locations.append(tree.getpath(r))
                    class_locations[find_class] = find_class_locations
                except Exception,e:
                    print(e)
                    continue
            final_result[url] = class_locations
        except Exception, e:
            print(e)
            continue
print(final_result)