Python 解析XSD文件以获取名称和描述_Python_Regex_Xsd

Python 解析XSD文件以获取名称和描述

python regex xsd

Python 解析XSD文件以获取名称和描述,python,regex,xsd,Python,Regex,Xsd,我正在尝试解析这个XSD文件，目前正在python中尝试，以获取元素的名称和数据的描述示例XSD：您应该避免使用regex解析xml/html/json，因为regex无法解析嵌套结构正则表达式无法捕获文本中名称和描述的所有实例的原因是，您为捕获描述而选择的字符集[\w\s\.]+不够，因为在描述中存在括号（请参见列表）等字符，这会导致进一步的预期匹配失败。尝试将[\w\s\.]+更改为+？，然后它就会工作。检查以下更新的regex101演示链接 Edit：显示如何使用解析xml以获取所

我正在尝试解析这个XSD文件，目前正在python中尝试，以获取元素的名称和数据的描述

示例XSD：

您应该避免使用regex解析xml/html/json，因为regex无法解析嵌套结构

正则表达式无法捕获文本中名称和描述的所有实例的原因是，您为捕获描述而选择的字符集

[\w\s\.]+

不够，因为在描述中存在括号

（请参见列表）

等字符，这会导致进一步的预期匹配失败。尝试将

[\w\s\.]+

更改为

+？

，然后它就会工作。检查以下更新的regex101演示链接

Edit：显示如何使用解析xml以获取所需信息的示例

import re
from bs4 import BeautifulSoup

data = '''<xs:element name="ProductDescription"><xs:annotation><xs:documentation>Provides the description of the product</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element name="ProductName"><xs:annotation><xs:documentation>Provides a name for the product. (see list)</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:token"><xs:enumeration value="Barbie Doll"/><xs:enumeration value="Ken Doll"/></xs:restriction></xs:simpleType></xs:element><xs:element name="ProductSize"><xs:annotation><xs:documentation>Describes the size of the product. (see list)</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:token"><xs:enumeration value="Small"/><xs:enumeration value="Medium"/><xs:enumeration value="Large"/><xs:enumeration value="Dayum"/></xs:restriction></xs:simpleType></xs:element></xs:sequence></xs:complexType></xs:element>'''

soup = BeautifulSoup(data)

for element in soup.find_all('xs:element'):
 print(element['name'])  # prints name attribute value
 print(element.find('xs:documentation').get_text(),'\n')  # prints inner text of xs:documentation tag

似乎是一个死链接。有没有更常见的方法来解析xsd以获得相同的信息？@lwileczek:似乎链接不知何故被删除了。我在回答中再次更新了链接。对于解析xml，您可以使用@lwileczek：我添加了示例代码，展示了如何使用beautiful soup轻松、更可靠地提取相同的信息。希望这能有所帮助。老实说，我认为可以使用正则表达式来解析html/xml。见鬼，它可以节省很多行，并允许您避免其他库的一些不足。对于您的，您可以很容易地使用以下内容：

>>re.findall（'[\S\S]*？（[\S\S]*？）
import re
import pandas as pd

df = pd.DataFrame({'Names': [ ], 'Description': [ ]})

search_str = r"name=\"(?P<name>\w+)\"\>[\w\<\/\.\>\d:]+documentation\>(?P<desc>[\w\s\.]+)\<\/"
file1 = 'mini_text.xml'

with open(file1, 'r') as f:
    xml_string = f.read()
idx = 0
for m in re.finditer(search_str, xml_string):
    df.loc[idx, 'Names'] = m.group('name')
    df.loc[idx, 'Description'] = m.group('desc')
    idx += 1

df.to_csv('output.txt', index=False, sep="\t")

import re
from bs4 import BeautifulSoup

data = '''<xs:element name="ProductDescription"><xs:annotation><xs:documentation>Provides the description of the product</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element name="ProductName"><xs:annotation><xs:documentation>Provides a name for the product. (see list)</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:token"><xs:enumeration value="Barbie Doll"/><xs:enumeration value="Ken Doll"/></xs:restriction></xs:simpleType></xs:element><xs:element name="ProductSize"><xs:annotation><xs:documentation>Describes the size of the product. (see list)</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:token"><xs:enumeration value="Small"/><xs:enumeration value="Medium"/><xs:enumeration value="Large"/><xs:enumeration value="Dayum"/></xs:restriction></xs:simpleType></xs:element></xs:sequence></xs:complexType></xs:element>'''

soup = BeautifulSoup(data)

for element in soup.find_all('xs:element'):
 print(element['name'])  # prints name attribute value
 print(element.find('xs:documentation').get_text(),'\n')  # prints inner text of xs:documentation tag

ProductDescription
Provides the description of the product

ProductName
Provides a name for the product. (see list)

ProductSize
Describes the size of the product. (see list)