Python 需要帮助编写xpath字符串以匹配多个（但不是全部）表单元格吗_Python_Xpath_Screen Scraping_Web Scraping_Minidom

Python 需要帮助编写xpath字符串以匹配多个（但不是全部）表单元格吗

python xpath web-scraping

Python 需要帮助编写xpath字符串以匹配多个（但不是全部）表单元格吗,python,xpath,screen-scraping,web-scraping,minidom,Python,Xpath,Screen Scraping,Web Scraping,Minidom,注：由于给出了一些早期答案，该问题已更新。这仍然是同一个问题，只是希望更清楚我正在尝试让一个站点刮板正常工作，但在为一些表单元格提供合适的xpath字符串时遇到了问题 <tbody> <tr> <td class="Label" width="20%" valign="top">Uninteresting section</td> <td class="Data"> I don't care about this&

注：由于给出了一些早期答案，该问题已更新。这仍然是同一个问题，只是希望更清楚

我正在尝试让一个站点刮板正常工作，但在为一些表单元格提供合适的xpath字符串时遇到了问题

<tbody>
  <tr>
    <td class="Label" width="20%" valign="top">Uninteresting section</td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td class="Label" width="20%" valign="top">Interesting section</td>
    <td class="Data"> I want this-1</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I want this-2</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I want this-n</td>
  </tr>
  <tr>
    <td class="Label" width="20%" valign="top">Uninteresting section</td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I don't care about this</td>
  </tr>
</tbody>


无趣部分
我不在乎这个
我不在乎这个
有趣的部分
我想要这个-1
我想要这个-2
我想要这个
无趣部分
我不在乎这个
我不在乎这个

我想要有趣部分中所有数据字段的内容。其中可以有任意数量。我不关心代码中的任何其他内容，但我需要所有这些

在上述示例中：我想要这个-1 我想要这个-2 我想要这个

如果相关的话，我将在Python 2.7中使用xml.dom.minidom和py-dom xpath

//tr[@class="Entry"]/td[@class="Data"]/text()

更新。它的作用是：

使用

tr

和

td

获取包含“章节标题”的

tbody

从这些文本中，获取每个带有c

lass=“Data”

您可以使用

然后你可以得到你不想要的下一部分的所有m tds

//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()

然后可以在Python中使用第一个n-MTDS

您可以尝试在XPath中使用position和count函数执行相同的操作：

  //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"][position() <= (count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text())  - count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()) )]/text()

//tr[@class=“Entry”]//tr。。。您正在寻找嵌套在另一个tr中的tr？有一个表，其中包含更多的表，我遗漏了一些结构，因为我已经可以匹配它了。我的问题是我在上面发布的部分，我不确定如何在不从其他部分获取数据单元格的情况下获取该特定部分中数据单元格的所有内容。标签单元格的内容是唯一使节在匹配方面不同的东西，所有节的结构都是相同的。这不好，它不会将其仅限于我想要的节。不完全是这样。这将只匹配第一个单元格，其他单元格不在同一个tr中。我更新了问题中的代码段以使其更清楚。不清楚您试图做什么，但据我所知，有很多tbody，在tbody中只有第一行有“Label”，我更新了问题中的代码段，现在应该更清楚了。我无法使用第三个选项，也没有XPath 2.0，但前两个选项可以完成任务：）谢谢。看起来像是第三个中的（…）[…]构造。已经需要XPath 2.0。如果将位置检查移到/text（）之前，则可能会起作用。（我会编辑）

//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()

  //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"][position() <= (count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text())  - count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()) )]/text()

 //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text() except  //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()