使用python从xml中提取标记_Python_Regex_Xml

使用python从xml中提取标记

python regex xml

使用python从xml中提取标记,python,regex,xml,Python,Regex,Xml,我正在尝试使用Python中的RE从XML文件中提取标记。我需要提取以标记“

我正在尝试使用Python中的RE从XML文件中提取标记。我需要提取以标记“ 但是，无论出于何种原因，如果您想坚持使用正则表达式，可以通过在第一个正则表达式返回的字符串列表上运行正则表达式来解决此问题。在我下载的输入的一小部分上运行的示例代码：

units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
    #first get your unit regex
    unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
    #there should only be one within each
    assert (len(unitid) == 1)
    #now find all pes for this unit
    PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
    # combine results
    output = unitid[0] + "\n"
    for pe in PE:
        output += pe + "\n"
    unitList.append(output)

for x in unitList:
    print(x)

units=re.findall（'Regex本身可能不是这里工作的工具。为什么不使用xml或beautifulSoup包呢？这段代码工作得很好。“断言（len（unitid）==1）”是什么“是吗？这是我添加的一个健全性检查，以确保每个循环迭代只有一个单元标记。如果失败，它将在打印断言失败错误后退出程序
with open('ALICE.per1_replaced.txt','r') as t:
  contents = t.read()

unitid=re.findall('<unit.*?"pe">', contents,  re.DOTALL)
PE=re.findall('<PE.*?</PE>', contents, re.DOTALL)
with open('PEtagsper1.txt','w') as fi:
    for i, p in zip(unitid, PE):
        fi.write( "{}\n{}\n".format(i, p))

<unit id="16" status="FINISHED" type="pe">
<PE producer="A1.ALICE_GG"><html>
  <head>

  </head>
  <body>
    Eu vou me atrasar!' (quando ela voltou a pensar sobre isso mais trade, 
    ocorreu-lhe que deveria ter achado isso curioso, mas na hora tudo pareceu 
    bastante natural); mas quando o Coelho de fato tirou um relógio do bolso 
    do colete e olhou-o, e então se apressou, Alice pôs-se de pé, pois lhe 
    ocorreu que nunca antes vira um coelho com um colete, ou com um relógio de 
    bolso pra tirar, e queimando de curiosidade, ela atravessou o campo atrás 
    dele correndo e, felizmente, chegou justo a tempo de vê-lo entrar dentro 
    de uma grande toca de coelho sob a cerca.
  </body>
</html></PE>

units=re.findall('<unit.*?</unit>', t, re.DOTALL)
unitList = []
for unit in units:
    #first get your unit regex
    unitid=re.findall('<unit.*?"pe">', unit, re.DOTALL) # same as the one you use
    #there should only be one within each
    assert (len(unitid) == 1)
    #now find all pes for this unit
    PE=re.findall("<PE.*?</PE>", unit, re.DOTALL) # same as the one you use
    # combine results
    output = unitid[0] + "\n"
    for pe in PE:
        output += pe + "\n"
    unitList.append(output)

for x in unitList:
    print(x)