Python：提取XML文本，特定标记下除外_Python_Xml

Python：提取XML文本，特定标记下除外

python xml

Python：提取XML文本，特定标记下除外,python,xml,Python,Xml,我有一个示例XML文件： <page> <title>Chapter 1</title> <content>Welcome to Chapter 1</content> <author>John Smith</author> </page> <page> <title>Chapter 2</title> <content>Welcome

我有一个示例XML文件：

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
  <author>John Smith</author>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
 <author>John Doe</author>
</page>

我正在使用ElementTree实现这个任务。有没有优雅、干净的解决方案

import bs4

xml = '''<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
  <author>John Smith</author>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
 <author>John Doe</author>
</page>'''

soup = bs4.BeautifulSoup(xml, 'lxml')
[(page.title.text, page.author.text)for page in soup('page')]

使用BeautifulSoup作为XML解析器，您可以参考

“我正在使用ElementTree实现此任务”——这可能是一个很好的起点。“有没有优雅、干净的解决方案”——很可能，但我们不会为您编写解决方案。展示你到目前为止所做的事情。请访问并阅读以了解如何有效地使用此网站。我现在掌握的是使用xpath，即类似于xpath（“*/text（）”）的内容。然而，我想要一个类似于黑名单的东西来过滤掉不需要的标签下的文本。你有什么建议吗？

import bs4

xml = '''<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
  <author>John Smith</author>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
 <author>John Doe</author>
</page>'''

soup = bs4.BeautifulSoup(xml, 'lxml')
[(page.title.text, page.author.text)for page in soup('page')]

[('Chapter 1', 'John Smith'), ('Chapter 2', 'John Doe')]