编写一个python脚本,根据任何类从HTML中找出XPath
在Python中,我希望用户在控制台提示符中输入URL(获取输入并将其存储在某个变量中), 例如,如果网页包含以下HTML:编写一个python脚本,根据任何类从HTML中找出XPath,python,web-scraping,scripting,Python,Web Scraping,Scripting,在Python中,我希望用户在控制台提示符中输入URL(获取输入并将其存储在某个变量中), 例如,如果网页包含以下HTML: <html> <head> </head> <body> <div> <h1 class="class_one">First heading</h1> <p>Some text</p>
<html>
<head>
</head>
<body>
<div>
<h1 class="class_one">First heading</h1>
<p>Some text</p>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h3 class="class_three">
</h3>
</center>
<center>
<h3 class="find_first_class">
Some text
</h3>
</center>
</div>
</div>
</div>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h2 class="find_second_class">
</h2>
</center>
</div>
</div>
</div>
</div>
</body>
</html>
这是Orhan提到的try/except逻辑。lxml解析传递给它的文档,可以通过xpath引用元素并提取类。之后,只需简单检查它们是否出现在所需的类中。lxml还允许通过ElementTree重建初始xpath
import csv
import requests
from lxml import etree
target_url = input('Which url is to be scraped?')
page = '''
<html>
<head>
</head>
<body>
<div>
<h1 class="class_one">First heading</h1>
<p>Some text</p>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h3 class="class_three">
</h3>
</center>
<center>
<h3 class="find_first_class">
Some text
</h3>
</center>
</div>
</div>
</div>
<div class="class_two">
<div class="class_three">
<div class="class_one">
<center class="class_two">
<h2 class="find_second_class">
</h2>
</center>
</div>
</div>
</div>
</div>
</body>
</html>
'''
#response = requests.get(target_url)
#document = etree.parse(response.content)
classes_list = ['find_first_class', 'find_second_class']
expressions = []
document = etree.fromstring(page)
for element in document.xpath('//*'):
try:
ele_class = element.xpath("@class")[0]
print(ele_class)
if ele_class in classes_list:
tree = etree.ElementTree(element)
expressions.append((ele_class, tree.getpath(element)))
except IndexError:
print("No class in this element.")
continue
with open('test.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(expressions)
导入csv
导入请求
从lxml导入etree
target_url=input('要刮取哪个url?'))
页码=“”
第一标题
一些文本
一些文本
'''
#response=requests.get(目标url)
#document=etree.parse(response.content)
类列表=[“查找第一类”、“查找第二类”]
表达式=[]
document=etree.fromstring(第页)
对于document.xpath('/*')中的元素:
尝试:
ele_class=element.xpath(“@class”)[0]
打印(ele_类)
如果类别列表中的元素类别:
tree=etree.ElementTree(元素)
expressions.append((ele_类,tree.getpath(元素)))
除索引器外:
打印(“此元素中没有类。”)
持续
以open('test.csv','w')作为f:
writer=csv.writer(f,分隔符=',')
writer.writerows(表达式)
对于任何看起来类似的人,我是如何做到这一点的:
import os
import csv
import requests
from lxml import html
csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
reader = csv.reader(f)
data = [temp_reader for temp_reader in reader]
final_result = {}
for item_list in data:
class_locations = {}
url = item_list[0]
item_list.pop(0)
try:
page = requests.get(url)
root = html.fromstring(page.text)
tree = root.getroottree()
for find_class in item_list:
find_class_locations =[]
try:
result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
for r in result:
find_class_locations.append(tree.getpath(r))
class_locations[find_class] = find_class_locations
except Exception,e:
print(e)
continue
final_result[url] = class_locations
except Exception, e:
print(e)
continue
print(final_result)
您可以在第一堂课和第二堂课中使用简单的“try-Exception”blok中的xpath。我认为无需浪费时间来检查网页中的特殊内容。@OrhanSolak感谢您的反馈,问题是我从未使用过python和web scraping,我想学习它,但现在我正在寻找一些脚本(由于工作的紧迫性)从我可以开始的地方开始,我以后一定会深入挖掘。这可能会有所帮助。即使你只阅读突出显示的部分,你也能在几分钟内完成你的工作。@OrhanSolak非常感谢,我读了它,但我找不到如何找到该节点的xpath,比如:/div[1]/div[2]/center[2]/h3[1],你能告诉我如何找到xpathTry:$x(“规范化空间(//div[@class='classu-two']]//div[@class='classu-one']//h3[@class='find_first_class']))如果将此代码复制粘贴到控制台(在chrome中按F12),输出将是h3之间的一些文本。顺便说一下,normalizespace只是将输出显示为可读的文本格式,在用python编写代码时不要使用这一部分
import os
import csv
import requests
from lxml import html
csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
reader = csv.reader(f)
data = [temp_reader for temp_reader in reader]
final_result = {}
for item_list in data:
class_locations = {}
url = item_list[0]
item_list.pop(0)
try:
page = requests.get(url)
root = html.fromstring(page.text)
tree = root.getroottree()
for find_class in item_list:
find_class_locations =[]
try:
result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
for r in result:
find_class_locations.append(tree.getpath(r))
class_locations[find_class] = find_class_locations
except Exception,e:
print(e)
continue
final_result[url] = class_locations
except Exception, e:
print(e)
continue
print(final_result)