编写XPath以选择描述_Xpath_Web Scraping_Scrapy_Web Crawler_Screen Scraping

编写XPath以选择描述

xpath web-scraping scrapy web-crawler

编写XPath以选择描述,xpath,web-scraping,scrapy,web-crawler,screen-scraping,Xpath,Web Scraping,Scrapy,Web Crawler,Screen Scraping,我想从HTML页面中提取描述我的divid包含以下数据： <div class="container page_op-detail"> <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:

我想从HTML页面中提取描述

我的

div

id包含以下数据：

  <div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>
 <p>
    <strong>Responsibilities</strong>
  </p>
  <ul>
     <li> Ownership and oversight of full-cycle accounts payable responsibilities including but not limited to, invoice processing, maintaining vendor records, running payment reports according to payment schedules, reconciling vendor statements)</li>
     <li> Identify and implement process improvements and automation in appropriate areas throughout the AP cycle</li>
     <li> Provide excellent customer service to vendors and employees by researching and resolving inquiries in a timely manner</li>
     <li> Assist with month-end activities, accruals, reconciliation, preparing 1099s, and audit support</li>
   <li> Assist with ad-hoc requests</li>
  </ul>
 <p>
    <strong>Qualifications</strong>
 </p>
  <ul>
     <li> AA/AS degree or equivalent experience in accounting</li>
     <li> Three years or more of related experience</li>
     <li> Full cycle accounts payable knowledge</li>
  </ul>
  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

这是行不通的。请帮助我创建XPath，哪些项目可以排除它。如何为此类查询编写XPATH

预期产出：

<div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>

  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>


解决世界上最棘手的问题绝非易事。我们的工程师经常发现自己身处战区、救灾工作甚至更糟糕的会议室之中。美联社专家确保我们的工程师拥有他们所需的一切工具，以解决地球上一些最具挑战性和最令人费解的问题。我们通过管理与全国和全球供应商的众多关系来实现这一目标，这些关系可以为我们的工程师提供使世界变得更安全所需的一切。随着我们公司的不断发展，我们不断思考如何改进和自动化流程，以便我们能够继续在世界上更多的地方提供惊人的成果

数据更有组织性。。。！！！

假设

表单

和

span

标记是空元素，您可以尝试以下xpath：

/div[@class='container page_op-detail']/*[not(self::p[normalize-space(.)='Responsibilities']) 
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Responsibilities']])
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Qualifications']])
                                        and not(self::p[normalize-space(.)='Qualifications'])]

首先，您的html代码缺少几个结束标记，包括

、
、

等。我假设以下html代码是正确的版本：

<div class="container page_op-detail">
<form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded"         action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21"></form>
<span id="ajax-view-state-page-container" style="display: none"></span>
<p> Solving the world’s hardest problems ... </p>
<p>
<strong>Responsibilities</strong>
</p>
<ul>
 <li> Ownership and oversight of full-cycle .....</li>
 <li> Identify and implement process improvements ...</li>
 <li> Provide excellent customer service to vendors ... </li>
 <li> Assist with month-end activities, accruals, ...</li>
<li> Assist with ad-hoc requests</li>
</ul>
<p>
<strong>Qualifications</strong>
</p>
<ul>
 <li> AA/AS degree or equivalent experience in accounting</li>
 <li> Three years or more of related experience</li>
 <li> Full cycle accounts payable knowledge</li>
</ul>
<p class="type-centered">
   Data is more organised...!!!
</p>
<p class="type-centered apply-button"></p>
</div>

您需要的下一个标签可以通过以下方式提取：

//div[@class="container page_op-detail"]/p[@class="type-centered"]/text()

然后，您可以使用itemloader将两个提取附加到同一项“description”中，如下所示：

rom scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')       
    l.add_xpath('name', '//div[@class="product_title"]')  //note: item 'name' are used twice.
    return l.load_item()

你的预期产出是什么？预期产出应该是：解决世界上最困难的问题绝非易事。我们的工程师经常发现

数据更有组织性

排除责任

资格

表单

span

form

span

ul

//div[@class="container page_op-detail"]/p[1]/text()

//div[@class="container page_op-detail"]/p[@class="type-centered"]/text()

rom scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')       
    l.add_xpath('name', '//div[@class="product_title"]')  //note: item 'name' are used twice.
    return l.load_item()