Xml div标记的Xpath排除span标记并返回文本_Xml_Xpath_Scrapy

Xml div标记的Xpath排除span标记并返回文本

xml xpath scrapy

Xml div标记的Xpath排除span标记并返回文本,xml,xpath,scrapy,Xml,Xpath,Scrapy,我需要下面的HTML代码的xpath <div itemtype="http://schema.org/PostalAddress" itemscope="" itemprop="jobLocation"> <div class="aiDetailJobInfoLabel aiDetailJobInfoLocation">Location: </div> <div class="aiDetailJobInfo aiDetailJobInfoL

我需要下面的HTML代码的xpath

  <div itemtype="http://schema.org/PostalAddress" itemscope="" itemprop="jobLocation">
  <div class="aiDetailJobInfoLabel aiDetailJobInfoLocation">Location: </div>
  <div class="aiDetailJobInfo aiDetailJobInfoLocation">
     <span itemprop="addressLocality">Topeka</span>
      , KS
      <span itemprop="postalCode">66607</span>
  </div>
</div>

如果我写下面的代码，它给出

response.xpath'//div[@itemprop=jobLocation]/div[@class=aiDetailJobInfo aiDetailJobInfoLocation]//text'.extract

请帮帮我

供参考：

xpath将包含不包含邮政编码的div文本，以便返回剩余的div和span文本。有时此div标签中不存在postalCode。因此，如果它存在，跳过它，如果不返回整个div标记文本。

这里我共享了两段代码。你需要什么就拿什么

试试这个：

response.xpath('//div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').re(r'[ .a-zA-Z]\w+')



response.xpath('//div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').re(r'[a-zA-Z]+')


response.xpath('//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').extract()[1:3]

看起来您基本上想要连接目标div的所有文本节点子体，postalCode属性下的子体除外。相关的文本节点集将由类似XPath的

//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]
   //text()[not(parent::span[@itemProp="postalCode"])]

如果您提取这个XPath，您将得到一个字符串列表，每个文本节点对应一个字符串，您可以在Python级别将它们连接在一起。

感谢您的关注。两者都只返回Topeka，但我需要Topeka，KS。它不应该包括邮政编码。在div标签中保留任何文本，所有内容都应该返回。很好。！我们不能只用xpath而不是正则表达式来处理吗？我计划只使用xpath。类似于，在不使用span postalCode标记的情况下提取div标记文本。您可以使用切片来获得类似response.xpath'//div[@itemprop=jobLocation]/div[@class=aiDetailJobInfo-aiDetailJobInfoLocation]//text的结果。提取[1:3]

response.xpath('//div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').re(r'[ .a-zA-Z]\w+')



response.xpath('//div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').re(r'[a-zA-Z]+')


response.xpath('//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]//text()').extract()[1:3]

//div[@itemprop="jobLocation"]/div[@class="aiDetailJobInfo aiDetailJobInfoLocation"]
   //text()[not(parent::span[@itemProp="postalCode"])]