Python解析非标准XML文件_Python_Xml Parsing

Python解析非标准XML文件

python

Python解析非标准XML文件,python,xml-parsing,Python,Xml Parsing,我的输入文件实际上是多个XML文件附加到一个文件中。（这是我的）。其结构如下： <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]> <root_node>...</root_node> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us

我的输入文件实际上是多个XML文件附加到一个文件中。（这是我的）。其结构如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>


...
...
...

Python xml.dom.minidom无法解析此非标准文件。解析此文件的更好方法是什么？我不低于代码是否有良好的性能

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line

对于内嵌中的行：
如果行==''：
xmldoc=minidom.parse（XMLstring）
其他：
XMLstring+=行

我不知道minidom，也不太了解XML解析，但我使用XPath解析XML/HTML。例如，在内部

您可以在这里找到一些XPath示例：

我选择分别解析每个XML块

您似乎已经在示例代码中这样做了。以下是我对您的代码的看法：

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("<?xml "):
        parse_xml_buffer(buffer)
        buffer = []  # reset buffer
    buffer.append(line)  # list operations are faster than concatenating strings
parse_xml_buffer(buffer)  # parse final chunk

下面是我对它的看法，使用生成器和

lxml.etree

。例如，提取的信息

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)

导入urllib2、os、zipfile
从lxml导入etree
def xmlspilter（data，separator=lambda x:x.startswith）（'我发布了一个使用生成器的版本，但看起来你比我更了解它+1@MattH你怎么知道地址h t t p://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip，拜托？@MattH我甚至没想过要“复制快捷方式”在超链接上！谢谢。你的代码是干净的，我下载并解压缩了你在链接中提供的zip存档。我获得了三个文件：ipgb20110104.xml、ipgb20110104rpt.html、ipgb20110104lst.txt。我在这三个文件中都没有找到上面的摘录。你的摘录来自哪里？-还有，你想利用什么你知道摘录的内容吗？@eyquem，它在xml文件中。我刚刚将“美国专利授权”节点替换为“根节点”，以使结构更清晰。谢谢。我已经理解了这一行……但我想知道为什么我没有理解另一行；我搜索得太肤浅了，我认为关键是输入文件不标准（格式不正确的）Xml文件；特别是单个文件中的多个Xml文档。lxml支持吗？
D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)

DocID:    US-D0629996-S1-20110104
Title:    Glove backhand
Assignee: Blackhawk Industries Product Group Unlimited LLC

DocID:    US-D0629997-S1-20110104
Title:    Belt sleeve
Assignee: None

DocID:    US-D0629998-S1-20110104
Title:    Underwear
Assignee: X-Technology Swiss GmbH

DocID:    US-D0629999-S1-20110104
Title:    Portion of compression shorts
Assignee: Nike, Inc.

DocID:    US-D0630000-S1-20110104
Title:    Apparel
Assignee: None

DocID:    US-D0630001-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630002-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630003-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630004-S1-20110104
Title:    Headwear cap
Assignee: None

DocID:    US-D0630005-S1-20110104
Title:    Footwear
Assignee: Vibram S.p.A.