如何将Python中的xml标记从input.txt中分离出来，然后对它们进行良好的格式化（制表符、换行符、嵌套）？_Python_Xml

如何将Python中的xml标记从input.txt中分离出来，然后对它们进行良好的格式化（制表符、换行符、嵌套）？

python xml

如何将Python中的xml标记从input.txt中分离出来，然后对它们进行良好的格式化（制表符、换行符、嵌套）？,python,xml,Python,Xml,我正在尝试获取包含以下文本的file.txt文件： <a> hello <a>world how <a>are </a>you?</a><a></a></a> <a> hello <a> world how <a> are </a>

我正在尝试获取包含以下文本的file.txt文件：

<a> hello 
<a>world 

how <a>are 
</a>you?</a><a></a></a>

<a> 
    hello 
    <a> 
        world how 
        <a> 
            are 
        </a> 
        you? 
    </a> 
<a> 
</a>

你好世界你好吗你呢？并将其转换为文本，如：

<a> hello 
<a>world 

how <a>are 
</a>you?</a><a></a></a>

<a> 
    hello 
    <a> 
        world how 
        <a> 
            are 
        </a> 
        you? 
    </a> 
<a> 
</a>


你好
世界怎么了
是
你呢？

我最初的想法是创建一个包含标记和内容（列表）的XML项，然后在该列表中嵌套更多包含内容的XML项，但花了一段时间后，我觉得我走错了方向

为此，我不能使用像元素树这样的库，我想从头开始解决这个问题。我不是在寻找所有的答案，我只是希望有人能帮我选择正确的方向，这样我就不会浪费更多的时间去设计一个无用的代码库
-----------------------------------回答如下--------------------------

from stack import Stack import re import sys def findTag(string): # checks to see if a string has an xml tag returns the tag or none try: match = re.search(r"\<(.+?)\>", string) return match.group(0), match.start(0) except: return None def isTag(string): # checks to see if a string is a tag and returns true or false. try: match = re.search(r"\<(.+?)\>", string) match.group(0) return True except: return False else: return False def split_tags_and_string(string): #splits up tag and string into a list L = [] for line in s.split("\n"): temp = line while len(temp) >0: #string still has some characters #print("line: " + temp) tag_tuple = (findTag(temp)) #returns a tuple with tag and starting index #print("tag_tuple: "+ str(tag_tuple)) if tag_tuple is not None: #there is a tag in the temp string if tag_tuple[1] == 0: #tag is the front of temp string L.append(tag_tuple[0].strip()) temp = temp.replace(tag_tuple[0], '', 1) temp = temp.strip() else: #tag is somewhere else other than the front of the temp string L.append(temp[0:tag_tuple[1]].strip()) temp = temp.replace(temp[0:tag_tuple[1]], '', 1) temp = temp.strip() else: #there is no tag in the temp string L.append(temp.strip()) temp = temp.replace(temp, '') return L def check_tags(formatted_list): # verifies that the xml is valid stack = Stack() x=0 try: #print(formatted_list) for item in formatted_list: tag = findTag(item) #print("tag: "+ str(tag)) if tag is not None: if tag[0].find('/') == -1: endtag = tag[0][0:1] + '/' +tag[0][1:] #print(endtag) if formatted_list.count(tag[0]) != formatted_list.count(endtag): #print("tag count doesn't match") return False, x if tag[0].find('/') == -1: #print("pushing: "+tag[0]) stack.push(tag[0]) else: #print("popping: "+tag[0]) stack.pop() x+=1 except: return False,x if stack.isEmpty(): return True,x else: return False,x def print_xml_list(formatted_list): indent = 0 string = str() previousIsString = False #print(formatted_list) for item in formatted_list: #print("previous = " + str(previousIsString)) #print(item) if len(item) > 0: if isTag(item) == True and item.find('/') == -1:#the item is a tag and not and end tag if previousIsString == True and string[len(string)-5:].find('\n') == -1: #add a newline if there isn't one already string+='\n' string+=(' '*indent+item+'\n') indent+=1 #increases indent previousIsString = False #previous isn't a string if isTag(item) == True and item.find('/') != -1: #the item is a tag and also an end tag if previousIsString == True: string+='\n' indent-=1 # reduces indent string+=(' '*indent+item+'\n') previousIsString = False #previous isn't a string if isTag(item) == False: if previousIsString: string+=(' '+item+' ') #adds item and no tab space else: string+=(' '*indent+item+' ') #adds item with tabs before previousIsString = True # previous is a string return string if __name__ == "__main__": filename = input("enter file name: ") file = open(filename, 'r') s = file.read() formatted = split_tags_and_string(s) #formats the string and tags into a list called formatted isGood = check_tags(formatted) # makes sure the xml is valid if isGood[0] == False: #if the xml is bad it says so and ends the program print("The xml file is bad.") else: string = print_xml_list(formatted) #adds indentation and formatting to the list and turns it into a string print(string) #prints the final result

从堆栈导入堆栈进口稀土导入系统 def findTag（字符串）： #检查字符串是否有xml标记返回标记或无尝试：匹配=重新搜索（r“\”，字符串）返回match.group（0），match.start（0）除：一无所获 def isTag（字符串）： #检查字符串是否为标记并返回true或false。尝试：匹配=重新搜索（r“\”，字符串）匹配组（0）返回真值除：返回错误其他：返回错误 def split_标记_和_字符串（字符串）： #将标签和字符串拆分为列表 L=[] 对于s.split（“\n”）中的行：温度=线虽然len（temp）>0:#字符串仍有一些字符 #打印（“行：+temp） tag_tuple=（findTag（temp））#返回带有标记和起始索引的元组 #打印（“标记元组：+str（标记元组））如果tag_tuple不是None:#临时字符串中有一个标记如果tag_tuple[1]==0:#tag是临时字符串的前面 L.append（标记\元组[0].strip（））临时=临时替换（标签组[0]，''，1）温度=温度带（） else:#标记位于临时字符串前面以外的其他位置 L.append（temp[0:tag_tuple[1]].strip（））临时=临时替换（临时[0:标记组[1]，''，1）温度=温度带（） else:#临时字符串中没有标记 L.append（临时条带（））温度=温度更换（温度“”）返回L def检查标签（格式化列表）： #验证xml是否有效 stack=stack（） x=0 尝试： #打印（格式化列表）对于格式化列表中的项目：标记=findTag（项目） #打印（“标记：+str（标记））如果标记不是无：如果标记[0]。查找（'/'）=-1: endtag=tag[0][0:1]+'/'+tag[0][1:] #打印（结束标记）如果格式化了列表计数（标记[0]）！=已格式化的列表计数（endtag）： #打印（“标记计数不匹配”）返回False，x 如果标记[0]。查找（'/'）=-1: #打印（“推送：“+标记[0]”） stack.push（标记[0]）其他： #打印（“弹出：+标记[0]） stack.pop（） x+=1 除：返回False，x 如果stack.isEmpty（）：返回True，x 其他：返回False，x def打印xml列表（格式化列表）：缩进=0 string=str（） previousIsString=False #打印（格式化列表）对于格式化列表中的项目： #打印（“previous=“+str（previousIsString）） #打印（项目）如果长度（项目）>0：如果isTag（item）=True和item.find（'/'）==-1:#该项是标记，而不是结束标记如果previousIsString==True和string[len（string）-5:]。则查找（'\n'）==1： #如果还没有换行符，请添加换行符字符串+='\n' 字符串+=（“”*缩进+项+'\n'）缩进+=1#增加缩进 previousIsString=False#previous不是字符串如果isTag（item）=True，并且item.find（'/'）！=-1:#项目是一个标记，也是一个结束标记如果previousIsString==True：字符串+='\n' 缩进-=1#减少缩进字符串+=（“”*缩进+项+'\n'） previousIsString=False#previous不是字符串如果isTag（项目）=False：如果之前的字符串：字符串+=（''+项+''）#添加项且无选项卡空间其他：字符串+=（''*缩进+项目+''）#在前面添加带制表符的项目 previousIsString=True#previous是一个字符串返回字符串如果名称=“\uuuuu main\uuuuuuuu”：文件名=输入（“输入文件名：”）文件=打开（文件名“r”） s=file.read（）格式化=拆分标记和字符串#将字符串和标记格式化为一个名为格式化的列表 isGood=check_标记（格式化）#确保xml有效如果isGood[0]==False:#如果xml不好，它会这样说并结束程序打印（“xml文件不正确。”）其他： string=print_xml_list（格式化）#向列表添加缩进和格式化，并将其转换为字符串打印（字符串）#打印最终结果

没有人提供答案，所以这里是我解析xml的基本方法，它没有处理类似于我上面提供的答案的功能。希望这对有类似好奇心的人有用。
一些想法：你的名字不符合；变量、函数和方法应该类似于
find\u-between-tags
，而不是
findbween-tags
。您不需要像
getContent
和
setContent
这样的getter和setter；只需直接访问
内容
。进一步了解