使用Python解析USPTO批量XML文件将xml.etree.ElementTree作为ET导入导入csv 进口稀土导入编解码器输入io xml=open（'ipa110106.xml'）行数=0 f=open（'workfile.xml'，'w'）对于xml中的行：行数+=1 如果行_num==1：打印（行）如果行中有“”和行数=1: 计数=计数+1 行=行。替换（“”，“”）如果“”在第行中：行=行。替换（“”，“”） count2+=1 如果第行中有“！DOCTYPE”：行=行。替换（“”，“”） f、写（行） f、关闭（）将open（“workfile.xml”）作为f： xml=f.read（） tree=ET.fromstring（re.sub（r“（]+\？>）”，r“\1”，xml）+“”） root=tree.getroot（）_Python_Xml

使用Python解析USPTO批量XML文件将xml.etree.ElementTree作为ET导入导入csv 进口稀土导入编解码器输入io xml=open（'ipa110106.xml'）行数=0 f=open（'workfile.xml'，'w'）对于xml中的行：行数+=1 如果行_num==1：打印（行）如果行中有“”和行数=1: 计数=计数+1 行=行。替换（“”，“”）如果“”在第行中：行=行。替换（“”，“”） count2+=1 如果第行中有“！DOCTYPE”：行=行。替换（“”，“”） f、写（行） f、关闭（）将open（“workfile.xml”）作为f： xml=f.read（） tree=ET.fromstring（re.sub（r“（]+\？>）”，r“\1”，xml）+“”） root=tree.getroot（）

python xml

使用Python解析USPTO批量XML文件将xml.etree.ElementTree作为ET导入导入csv 进口稀土导入编解码器输入io xml=open（'ipa110106.xml'）行数=0 f=open（'workfile.xml'，'w'）对于xml中的行：行数+=1 如果行_num==1：打印（行）如果行中有“”和行数=1: 计数=计数+1 行=行。替换（“”，“”）如果“”在第行中：行=行。替换（“”，“”） count2+=1 如果第行中有“！DOCTYPE”：行=行。替换（“”，“”） f、写（行） f、关闭（）将open（“workfile.xml”）作为f： xml=f.read（） tree=ET.fromstring（re.sub（r“（]+\？>）”，r“\1”，xml）+“”） root=tree.getroot（）,python,xml,Python,Xml,结果: import xml.etree.ElementTree as ET import csv import re import codecs import io xml = open('ipa110106.xml') line_num=0 f = open('workfile.xml', 'w') for line in xml: line_num+=1 if line_num == 1: print (line) if '<?xml ver

结果:

import xml.etree.ElementTree as ET
import csv
import re
import codecs
import io


xml = open('ipa110106.xml')
line_num=0
f = open('workfile.xml', 'w')

for  line in xml:
   line_num+=1
   if line_num == 1:
       print (line)

   if '<?xml version="1.0" encoding="UTF-8"?>' in line and line_num !=1:
       count =count+1
       line = line.replace('<?xml version="1.0" encoding="UTF-8"?>', '')
   if '<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>' in line:
       line = line.replace('<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>', '')  
       count2+=1
   if "!DOCTYPE" in line:
    line=line.replace('<!DOCTYPE sequence-cwu SYSTEM "us-sequence-listing.dtd" [ ]>','')  
   f.write(line)  
f.close()

with open("workfile.xml") as f:
 xml = f.read()
 tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
 root= tree.getroot()


0
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
runfile（'C:/Users/Harshit/Downloads/ipa110106（1）/parsing_test5.py'，wdir='C:/Users/Harshit/Downloads/ipa110106（1）'
文件“C:\Users\Harshit\Anaconda3\lib\site packages\spyder\utils\site\sitecustomize.py”，第866行，在runfile中
execfile（文件名、命名空间）
文件“C:\Users\Harshit\Anaconda3\lib\site packages\spyder\utils\site\sitecustomize.py”，第102行，在execfile中
exec（编译（f.read（），文件名，'exec'），命名空间）
文件“C:/Users/Harshit/Downloads/ipa110106（1）/parsing_test5.py”，第41行，在
root=tree.getroot（）
AttributeError:'xml.etree.ElementTree.Element'对象没有属性'getroot'

我试图解析USPTO XML文件以提取相关信息。这些文件是多个XML文件的串联，按照本论坛中给出的标准建议，我删除了多个实例：

和

因为它们也会导致错误：

ParseError:格式不正确（无效令牌）：第2行第2列

最后，在从XML中删除这些麻烦的元素之后，我创建了一个合成的父根，以将该文件转换为正确的XML格式。然而，当我试图解析这个文件并访问它的根时，我遇到了一个错误。我已在邮件中附上代码

<?xml version="1.0" encoding="UTF-8"?>

0
Traceback (most recent call last):

  File "<ipython-input-164-4d6fc9ea9aac>", line 1, in <module>
    runfile('C:/Users/Harshit/Downloads/ipa110106 (1)/parsing_test5.py', wdir='C:/Users/Harshit/Downloads/ipa110106 (1)')

  File "C:\Users\Harshit\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\Users\Harshit\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/Harshit/Downloads/ipa110106 (1)/parsing_test5.py", line 41, in <module>
    root= tree.getroot()

AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getroot'

将xml.etree.ElementTree作为ET导入
导入csv
进口稀土
导入编解码器
输入io
xml=open（'ipa110106.xml'）
行数=0
f=open（'workfile.xml'，'w'）
对于xml中的行：
行数+=1
如果行_num==1：
打印（行）
如果行中有“”和行数=1:
计数=计数+1
行=行。替换（“”，“”）
如果“”在第行中：
行=行。替换（“”，“”）
count2+=1
如果第行中有“！DOCTYPE”：
行=行。替换（“”，“”）
f、 写（行）
f、 关闭（）
将open（“workfile.xml”）作为f：
xml=f.read（）
tree=ET.fromstring（re.sub（r“（]+\？>）”，r“\1”，xml）+“”）
root=tree.getroot（）

而且，XML文件很大，我只能共享指向它的链接-

XML（like）文件的一个小示例：


美国
20110000001
A1
20110106
美国
12838840
20100719
12
白细胞介素
189088
20080128

如果在XML声明中拆分当前PTO XML文件并分别处理每个发布，则当前PTO XML文件是有效的XML。我希望尝试一次处理它们，以使用非常大的内存量。无论哪种方式，都不需要您正在进行的替换

我的解决方案是创建一个拥有zipfile的类（对于其他可能不知道的人，数据是一个zip文件，其中包含一个包含连接XML文件的文件），并有一个函数依次生成每个XML文件。然后我使用

ET.XML（）

来处理这些文件。

代码在哪里？xmlDo示例不在外部链接中发布代码；请将其包含在您的帖子中。您好，我已经发布了代码和指向XML的链接。对不起，这篇文章不完整。谢谢你的及时回复。我刚刚开始拆分，并且能够从列表中正确提取一个XML文件。剩下的我会重复一遍。但是，我仍然必须删除行：“”，因为这是一个无效的令牌错误。这很奇怪。DOCTYPE在XML文件中是有效的，您要删除的DOCTYPE没有问题。至少你有一些有用的东西。

import xml.etree.ElementTree as ET
import csv
import re
import codecs
import io


xml = open('ipa110106.xml')
line_num=0
f = open('workfile.xml', 'w')

for  line in xml:
   line_num+=1
   if line_num == 1:
       print (line)

   if '<?xml version="1.0" encoding="UTF-8"?>' in line and line_num !=1:
       count =count+1
       line = line.replace('<?xml version="1.0" encoding="UTF-8"?>', '')
   if '<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>' in line:
       line = line.replace('<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>', '')  
       count2+=1
   if "!DOCTYPE" in line:
    line=line.replace('<!DOCTYPE sequence-cwu SYSTEM "us-sequence-listing.dtd" [ ]>','')  
   f.write(line)  
f.close()

with open("workfile.xml") as f:
 xml = f.read()
 tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
 root= tree.getroot()

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.2 2006-08-23" file="US20110000001A1-20110106.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20101222" date-publ="20110106">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<country>US</country>
<doc-number>20110000001</doc-number>
<kind>A1</kind>
<date>20110106</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>12838840</doc-number>
<date>20100719</date>
</document-id>
</application-reference>
<us-application-series-code>12</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>IL</country>
<doc-number>189088</doc-number>
<date>20080128</date>
</priority-claim>
</priority-claims>
<classifications-ipcr>
<classification-ipcr>