如何使用Python在多行文本中搜索XPath中的内容？_Python_Xpath_Lxml

如何使用Python在多行文本中搜索XPath中的内容？

python xpath

如何使用Python在多行文本中搜索XPath中的内容？,python,xpath,lxml,Python,Xpath,Lxml,当我使用contains搜索元素的text（）中是否存在数据时，它适用于纯数据，但在元素内容中存在回车、新行/标记时不适用。如何使//td[contains（text（），“”）]在这种情况下工作？谢谢大家! XML: <table> <tr> <td> Hello world <i> how are you? </i> Have a wonderful day. Good bye!

当我使用contains搜索元素的text（）中是否存在数据时，它适用于纯数据，但在元素内容中存在回车、新行/标记时不适用。如何使

//td[contains（text（），“”）]

在这种情况下工作？谢谢大家!

XML:

<table>
  <tr>
    <td>
      Hello world <i> how are you? </i>
      Have a wonderful day.
      Good bye!
    </td>
  </tr>
  <tr>
    <td>
      Hello NJ <i>, how are you?
      Have a wonderful day.</i>
    </td>
  </tr>
</table>

>>> tdout=open('tdmultiplelines.htm', 'r')
>>> tdouthtml=lh.parse(tdout)
>>> tdout.close()
>>> tdouthtml
<lxml.etree._ElementTree object at 0x2aaae0024368>
>>> tdouthtml.xpath('//td/text()')
['\n      Hello world ', '\n      Have a wonderful day.\n      Good bye!\n    ', '\n      Hello NJ ', '\n    ']
>>> tdouthtml.xpath('//td[contains(text(),"Good bye")]')
[]  ##-> But *Good bye* is already in the `td` contents, though as a list.
>>> tdouthtml.xpath('//td[text() = "\n      Hello world "]')
[<Element td at 0x2aaae005c410>]

<table>
      <tr>
        <td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>
      </tr>
      <tr>
        <td>
          Hello NJ <i>, how are you?
          Have a wonderful day.</i>
        </td>
      </tr>
</table>


你好，世界你好吗？
祝你有美好的一天。
再见！
你好，NJ，你好吗？
祝你有美好的一天。

Python:

<table>
  <tr>
    <td>
      Hello world <i> how are you? </i>
      Have a wonderful day.
      Good bye!
    </td>
  </tr>
  <tr>
    <td>
      Hello NJ <i>, how are you?
      Have a wonderful day.</i>
    </td>
  </tr>
</table>

>>> tdout=open('tdmultiplelines.htm', 'r')
>>> tdouthtml=lh.parse(tdout)
>>> tdout.close()
>>> tdouthtml
<lxml.etree._ElementTree object at 0x2aaae0024368>
>>> tdouthtml.xpath('//td/text()')
['\n      Hello world ', '\n      Have a wonderful day.\n      Good bye!\n    ', '\n      Hello NJ ', '\n    ']
>>> tdouthtml.xpath('//td[contains(text(),"Good bye")]')
[]  ##-> But *Good bye* is already in the `td` contents, though as a list.
>>> tdouthtml.xpath('//td[text() = "\n      Hello world "]')
[<Element td at 0x2aaae005c410>]

<table>
      <tr>
        <td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>
      </tr>
      <tr>
        <td>
          Hello NJ <i>, how are you?
          Have a wonderful day.</i>
        </td>
      </tr>
</table>

>tdout=open（'tdmultiplelines.htm'，'r'）
>>>tdouthtml=lh.parse（tdout）
>>>t out.close（）
>>>tdouthtml
>>>tdouthtml.xpath（“//td/text（）”）
['\n Hello world'，'\n祝你有一个美好的一天。\n再见！\n'，'\n Hello NJ'，'\n']
>>>tdouthtml.xpath（“//td[contains（text（），“再见”）]”）
[]##->但是“再见”已经出现在“td”的内容中，尽管它是一个列表。
>>>tdouthtml.xpath（'//td[text（）=“\n Hello world”]”）
[]

使用

而不是

text（）

：

使用：

//td[text()[contains(.,'Good bye')]]

//td[contains(text(),"Good bye")]

 "
                 Hello world "

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/>
 </xsl:template>
</xsl:stylesheet>

<td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>

说明：

//td[text()[contains(.,'Good bye')]]

//td[contains(text(),"Good bye")]

 "
                 Hello world "

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/>
 </xsl:template>
</xsl:stylesheet>

<td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>

问题的原因不是文本节点的字符串值是多行字符串——真正的原因是

td

元素有多个文本节点子元素

在提供的表达式中：

//td[text()[contains(.,'Good bye')]]

//td[contains(text(),"Good bye")]

 "
                 Hello world "

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/>
 </xsl:template>
</xsl:stylesheet>

<td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>

传递给函数
contains（）
的第一个参数是一个包含多个文本节点的节点集

根据XPath 1.0规范（在XPath 2.0中，这只会引发一个类型错误），对需要字符串参数但被传递给节点集的函数的求值只接受节点集中第一个节点的字符串值

在这种特定情况下，传递的节点集的第一个文本节点具有字符串值：

//td[text()[contains(.,'Good bye')]]

//td[contains(text(),"Good bye")]

 "
                 Hello world "

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/>
 </xsl:template>
</xsl:stylesheet>

<td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>

因此比较失败，并且未选择所需的
td
元素
基于XSLT的验证：

//td[text()[contains(.,'Good bye')]]

//td[contains(text(),"Good bye")]

" Hello world "

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/> </xsl:template> </xsl:stylesheet>

<td> Hello world <i> how are you? </i> Have a wonderful day. Good bye! </td>

在提供的XML文档上应用此转换时：

<table> <tr> <td> Hello world <i> how are you? </i> Have a wonderful day. Good bye! </td> </tr> <tr> <td> Hello NJ <i>, how are you? Have a wonderful day.</i> </td> </tr> </table>

>>> tdout=open('tdmultiplelines.htm', 'r') >>> tdouthtml=lh.parse(tdout) >>> tdout.close() >>> tdouthtml <lxml.etree._ElementTree object at 0x2aaae0024368> >>> tdouthtml.xpath('//td/text()') ['\n Hello world ', '\n Have a wonderful day.\n Good bye!\n ', '\n Hello NJ ', '\n '] >>> tdouthtml.xpath('//td[contains(text(),"Good bye")]') [] ##-> But *Good bye* is already in the `td` contents, though as a list. >>> tdouthtml.xpath('//td[text() = "\n Hello world "]') [<Element td at 0x2aaae005c410>]

<table> <tr> <td> Hello world <i> how are you? </i> Have a wonderful day. Good bye! </td> </tr> <tr> <td> Hello NJ <i>, how are you? Have a wonderful day.</i> </td> </tr> </table>

你好，世界你好吗？祝你有美好的一天。再见！你好，NJ，你好吗？祝你有美好的一天。
计算XPath表达式，并将所选节点（在本例中仅一个）复制到输出中：

//td[text()[contains(.,'Good bye')]]

//td[contains(text(),"Good bye")]

" Hello world "

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/> </xsl:template> </xsl:stylesheet>

<td> Hello world <i> how are you? </i> Have a wonderful day. Good bye! </td>

你好，世界你好吗？祝你有美好的一天。再见！
谢谢您的解释//在我看来，td[text（）[contains（，'Good'）]]与此类似。选择这个作为帮助我和其他人理解这一点的答案@ThinkCode：不客气。实际上，
//td[contains（，“再见”）]
可能会导致误报，因为
被转换为上下文节点的字符串值。如果元素有多个文本节点的子代，则所有这些子代都将在形成其字符串值时连接起来。如果一个元素有两个连续的文本节点子元素，第一个以搜索字符串的起始子字符串结尾，第二个以搜索字符串的其余部分开头，那么您可能不希望选择该元素。嗯，我有点困惑。你能给我们举个例子说明两种实现之间的区别吗？非常感谢你@ThinkCode:
string1string2string3
您正在寻找一个包含
string1string2
@ThinkCode的文本节点：在我的解决方案中，我检查是否有任何文本节点包含整个搜索字符串。因此，如果您搜索“核心”，并且有两个连续的文本节点：“Humphry&co”和“related”，那么另一个解决方案会选择这些文本节点的父节点，而我的解决方案不会。