Python XML删除标记内的换行符

Python XML删除标记内的换行符,python,regex,xml,well-formed,Python,Regex,Xml,Well Formed,问题是,在我从SEC抓取的一些xml文件中,标记内有换行符。因此,这些xml文件的格式不正确 <footnote id="F4">Shares sold on the open market are reported as an average sell price per share of $56.87; breakdown of shares sold and per share sale prices are as follows; 100 at $56.3

问题是,在我从SEC抓取的一些xml文件中,标记内有换行符。因此,这些xml文件的格式不正确

<footnote id="F4">Shares sold on the open market are reported as an average sell price per share of $56.87; breakdown of shares sold and per share sale prices are as follows; 100 at $56.31; 200 at $56.32; 100 at $56.33; 198 at $56.39; 600 at $56.40; 100 at $56.41; 102 at $56.42; 600 at $56.44; 320 at $56.45; 100 at $56.46; 900 at $56.47; 480 at $56.48; 300 at $56.49; 1,200 at $56.50; 400 at $56.51; 1,130 at $56.52; 600 at $56.53; 100 at $56.54; 1,500 at $56.55; 600 at $56.56; 644 at $56.57; 1,656 at $56.58; 1,070 at $56.59; 2069 at $56.60; 1,831 at $56.61; 1,000 at $56.62; 1,000 at $56.63; 492 at $56.64; 1,400 at $56.65; 920 at $56.66; 1,000 at $56.67; 600 at $56.68; 500 at $56.69; 1,200 at $56.70; 500 at $56.71; 582 at $56.72; 400 at $56.73; 1,108 at $56.74; 37 at $56.75; 710 at $56.76; 630 at $56.77; 1,600 at $56.78; 400 at $56.79; 400 at $56.80; 1,500 at $56.81; 1,100 at $56.82; 100 at $56.83; 800 at $56.84; 200 at $56.85; 1,300 at $56.87; additional shares sold continued on Footnote (5).</footnot
e>
公开市场上出售的股票的平均每股售价为56.87美元;已售出股份和每股售价的明细如下:;100美元,56.31美元;200美元,56.32美元;100美元,56.33美元;198美元,56.39美元;600美元,56.40美元;100美元,56.41美元;102美元,56.42美元;600美元,56.44美元;320美元,56.45美元;100美元,56.46美元;900美元,56.47美元;480美元,56.48美元;300美元,56.49美元;1200美元,56.50美元;400美元,56.51美元;1130美元,56.52美元;600美元,56.53美元;100美元,56.54美元;1500美元,56.55美元;600美元,56.56美元;644美元,56.57美元;1656美元,56.58美元;1070美元,56.59美元;2069美元,56.60美元;1831美元,56.61美元;1000美元,56.62美元;1000美元,56.63美元;492美元,56.64美元;1400美元,56.65美元;920美元,56.66美元;1000美元,56.67美元;600美元,56.68美元;500美元,56.69美元;1200美元,56.70美元;500美元,56.71美元;582美元,56.72美元;400美元,56.73美元;1108美元,56.74美元;37美元,56.75美元;710美元,56.76美元;630美元,56.77美元;1600美元,56.78美元;400美元,56.79美元;400美元,56.80美元;1500美元,56.81美元;1100美元,56.82美元;100美元,56.83美元;800美元,56.84美元;200美元,56.85美元;1300美元,56.87美元;脚注(5)中继续出售额外股份。
我的第一个想法是,这是因为utf-8和ISO-8859-1的编码不同,但更改编码后问题仍然存在。 我的下一个解决方案是一个正则表达式,它可以检测标记中的那些换行符,但是由于它们可以发生在任何地方,所以这个解决方案不是很可靠

你们对如何解决这个问题有什么想法吗?

因为可以这样做:

import re

# open the txt file
with open("0001112679-10-000086.txt", "r", encoding="utf8") as f:
    txt = f.read();

# cut out the xml part from the txt file
start = txt.find("<XML>")
end = txt.find("</XML>") + 6
xml = txt[start:end]

# process the xml part
xml = re.sub(r"([^\n]{1023})\n", r"\1", xml)

# combine a new txt back from the parts
new_txt = txt[:start] + xml + txt[end:]

# save the new txt in file
with open("0001112679-10-000086_output.txt", "w", encoding="utf8") as f:
    f.write(new_txt)
重新导入
#打开txt文件
打开(“0001112679-10-000086.txt”,“r”,encoding=“utf8”)作为f:
txt=f.read();
#从txt文件中剪切xml部分
start=txt.find(“”)
end=txt.find(“”+6
xml=txt[开始:结束]
#处理xml部分
xml=re.sub(r“([^\n]{1023})\n”,r“\1”,xml)
#从零件中合并一个新的txt
new_txt=txt[:start]+xml+txt[end:]
#在文件中保存新的txt
打开(“0001112679-10-000086_output.txt”,“w”,encoding=“utf8”)作为f:
f、 写入(新文本)

您可以共享指向此类xml文件的链接吗?当然,问题是行的长度。1024个字符后,该行继续下一行,因此尝试提出逻辑,其中一行是1024个字符,下一行以>结尾,并将它们粘合在一起;它们是看起来有点像XML的文本文件。祝你好运