在XML中查找元素同级的最具python风格的方法_Python_Xml_Xpath

在XML中查找元素同级的最具python风格的方法

python xml xpath

在XML中查找元素同级的最具python风格的方法,python,xml,xpath,Python,Xml,Xpath,问题：我有以下XML片段： ...snip... DEFINITION This, these. <p class="p_cat_heading"&

问题： 我有以下XML片段：

...snip...
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
..snip...

我尝试过的事情：

使用lxml的getnext方法。这将获得下一个具有属性“p_cat_heading”的兄弟姐妹，这不是我想要的
following_sibling-lxml应该支持这一点，但它抛出“在前缀映射中找不到following sibling”

我的解决方案：

我还没有完成它，但是因为我的XML很短，所以我只想得到一个所有元素的列表，迭代到具有DEFINITION属性的元素，然后迭代到具有p_cat_heading属性的下一个元素。这个解决方案既可怕又丑陋，但我似乎找不到一个干净的替代方案

我在找什么：

在我们的例子中，这是一种更具python风格的打印定义的方式，即“这个，这些”。解决方案可以使用xpath或其他方法。首选Python本机解决方案，但一切都可以。

您可以将BeatifulSoup与CSS选择器一起用于此任务。选择器

.p_cat_heading:contains（“DEFINITION”）~.p_cat_heading

将选择带有class

p_cat_heading

的所有元素，这些元素前面带有class

p_cat_heading

包含字符串“DEFINITION”：

进一步阅读

编辑：

要在定义后选择直接同级，请执行以下操作：

data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for heading in soup.select('.p_cat_heading:contains("DEFINITION") ~ .p_cat_heading'):
    print(heading.text)

data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This is after DEFINITION</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
<p class="p_numberedbullet"><span class="calibre10">This is after PRONUNCIATION</span>, <span class="calibre10">these</span>. </p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

s = soup.select_one('.p_cat_heading:contains("DEFINITION") + :not(.p_cat_heading)')
print(s.text)

有几种方法可以做到这一点，但是通过依赖xpath完成大部分工作，这个表达式

//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]

应该有用

使用lxml：

from lxml import html

data = [your snippet above]
exp = "//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]"

tree = html.fromstring(data) 
target = tree.xpath(exp)

for i in target:
    print(i.text_content())

输出：

这个,这些

你可以使用BeautifulSoup吗？我以前在HTML中使用过它，如果我必须重构的话，我也可以，但我一直在寻找更轻一点的东西，因为这只是几行。也就是说，也许这是更好的解决方案。只是澄清一下-是预期的输出

This，This。发音

？应该是“This，This”。再次抱歉，应该选择一个不太通用的汉字。但它需要打印定义的内容。原始代码将在标题上成功迭代-问题在于准确获取标题下的内容。编辑：为了澄清这一点，定义是一行，里面有numberedbullet。太棒了！谢谢你，伙计。我将给它两三天的时间，看看是否有人能想出一种在Python中使用XML功能的方法。如果不是的话，我会将这一个标记为答案，并重构代码的其余部分以使用BS。此外，我应该在定义更明显的地方选择一个词。我选择了中文這 - 这个/这些哈哈哈。很抱歉。

This is after DEFINITION, these.

//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]

from lxml import html

data = [your snippet above]
exp = "//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]"

tree = html.fromstring(data) 
target = tree.xpath(exp)

for i in target:
    print(i.text_content())