编写XPath以选择描述

编写XPath以选择描述,xpath,web-scraping,scrapy,web-crawler,screen-scraping,Xpath,Web Scraping,Scrapy,Web Crawler,Screen Scraping,我想从HTML页面中提取描述 我的divid包含以下数据: <div class="container page_op-detail"> <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:

我想从HTML页面中提取描述

我的
div
id包含以下数据:

  <div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>
 <p>
    <strong>Responsibilities</strong>
  </p>
  <ul>
     <li> Ownership and oversight of full-cycle accounts payable responsibilities including but not limited to, invoice processing, maintaining vendor records, running payment reports according to payment schedules, reconciling vendor statements)</li>
     <li> Identify and implement process improvements and automation in appropriate areas throughout the AP cycle</li>
     <li> Provide excellent customer service to vendors and employees by researching and resolving inquiries in a timely manner</li>
     <li> Assist with month-end activities, accruals, reconciliation, preparing 1099s, and audit support</li>
   <li> Assist with ad-hoc requests</li>
  </ul>
 <p>
    <strong>Qualifications</strong>
 </p>
  <ul>
     <li> AA/AS degree or equivalent experience in accounting</li>
     <li> Three years or more of related experience</li>
     <li> Full cycle accounts payable knowledge</li>
  </ul>
  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>
这是行不通的。请帮助我创建XPath,哪些项目可以排除它。如何为此类查询编写XPATH

预期产出:

<div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>

  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

解决世界上最棘手的问题绝非易事。我们的工程师经常发现自己身处战区、救灾工作甚至更糟糕的会议室之中。美联社专家确保我们的工程师拥有他们所需的一切工具,以解决地球上一些最具挑战性和最令人费解的问题。我们通过管理与全国和全球供应商的众多关系来实现这一目标,这些关系可以为我们的工程师提供使世界变得更安全所需的一切。随着我们公司的不断发展,我们不断思考如何改进和自动化流程,以便我们能够继续在世界上更多的地方提供惊人的成果

数据更有组织性。。。!!!


假设
表单
span
标记是空元素,您可以尝试以下xpath:

/div[@class='container page_op-detail']/*[not(self::p[normalize-space(.)='Responsibilities']) 
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Responsibilities']])
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Qualifications']])
                                        and not(self::p[normalize-space(.)='Qualifications'])]

首先,您的html代码缺少几个结束标记,包括

等。我假设以下html代码是正确的版本:

<div class="container page_op-detail">
<form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded"         action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21"></form>
<span id="ajax-view-state-page-container" style="display: none"></span>
<p> Solving the world’s hardest problems ... </p>
<p>
<strong>Responsibilities</strong>
</p>
<ul>
 <li> Ownership and oversight of full-cycle .....</li>
 <li> Identify and implement process improvements ...</li>
 <li> Provide excellent customer service to vendors ... </li>
 <li> Assist with month-end activities, accruals, ...</li>
<li> Assist with ad-hoc requests</li>
</ul>
<p>
<strong>Qualifications</strong>
</p>
<ul>
 <li> AA/AS degree or equivalent experience in accounting</li>
 <li> Three years or more of related experience</li>
 <li> Full cycle accounts payable knowledge</li>
</ul>
<p class="type-centered">
   Data is more organised...!!!
</p>
<p class="type-centered apply-button"></p>
</div>
您需要的下一个标签可以通过以下方式提取:

//div[@class="container page_op-detail"]/p[@class="type-centered"]/text()
然后,您可以使用itemloader将两个提取附加到同一项“description”中,如下所示:

rom scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')       
    l.add_xpath('name', '//div[@class="product_title"]')  //note: item 'name' are used twice.
    return l.load_item()

你的预期产出是什么?预期产出应该是:解决世界上最困难的问题绝非易事。我们的工程师经常发现

数据更有组织性

排除责任

    资格
        因此我想删除和
          列表。最好在您的问题中包含您的预期输出,而不是在评论中,出于可读性目的,
          表单
          span
          以及最后一个
          p
          结束标记缺失。您的输入格式不正确。谢谢您的回答。。。但是仍然不起作用。。它显示的是完整的空列表。。有其他解决方案吗?我假设
          form
          span
          标记是空元素。请修正您的输入。他们是
          p
          s和
          ul
          s的祖先吗?我就是这样用的。“//div[@class='container page_op-detail']/*[not(self::p[normalize space(.)='responsibility'])和not(self::ul)和not(self::p[normalize space(.)='Qualifications'])和not(self)我已经使用xpathtester.com测试了这一点。请查看此()请查看此url:
          //div[@class="container page_op-detail"]/p[1]/text()
          
          //div[@class="container page_op-detail"]/p[@class="type-centered"]/text()
          
          rom scrapy.loader import ItemLoader
          from myproject.items import Product
          
          def parse(self, response):
              l = ItemLoader(item=Product(), response=response)
              l.add_xpath('name', '//div[@class="product_name"]')       
              l.add_xpath('name', '//div[@class="product_title"]')  //note: item 'name' are used twice.
              return l.load_item()