Python解析非标准XML文件

Python解析非标准XML文件,python,xml-parsing,Python,Xml Parsing,我的输入文件实际上是多个XML文件附加到一个文件中。(这是我的)。其结构如下: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]> <root_node>...</root_node> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us

我的输入文件实际上是多个XML文件附加到一个文件中。(这是我的)。其结构如下:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

...
...
...
Python xml.dom.minidom无法解析此非标准文件。解析此文件的更好方法是什么?我不低于代码是否有良好的性能

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line
对于内嵌中的行:
如果行=='':
xmldoc=minidom.parse(XMLstring)
其他:
XMLstring+=行

我不知道minidom,也不太了解XML解析,但我使用XPath解析XML/HTML。例如,在内部


您可以在这里找到一些XPath示例:

我选择分别解析每个XML块

您似乎已经在示例代码中这样做了。以下是我对您的代码的看法:

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("<?xml "):
        parse_xml_buffer(buffer)
        buffer = []  # reset buffer
    buffer.append(line)  # list operations are faster than concatenating strings
parse_xml_buffer(buffer)  # parse final chunk

下面是我对它的看法,使用生成器和
lxml.etree
。例如,提取的信息

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)
导入urllib2、os、zipfile
从lxml导入etree

def xmlspilter(data,separator=lambda x:x.startswith)('我发布了一个使用生成器的版本,但看起来你比我更了解它+1@MattH你怎么知道地址h t t p://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip,拜托?@MattH我甚至没想过要“复制快捷方式”在超链接上!谢谢。你的代码是干净的,我下载并解压缩了你在链接中提供的zip存档。我获得了三个文件:ipgb20110104.xml、ipgb20110104rpt.html、ipgb20110104lst.txt。我在这三个文件中都没有找到上面的摘录。你的摘录来自哪里?-还有,你想利用什么你知道摘录的内容吗?@eyquem,它在xml文件中。我刚刚将“美国专利授权”节点替换为“根节点”,以使结构更清晰。谢谢。我已经理解了这一行……但我想知道为什么我没有理解另一行;我搜索得太肤浅了,我认为关键是输入文件不标准(格式不正确的)Xml文件;特别是单个文件中的多个Xml文档。lxml支持吗?
D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...
import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)
DocID: US-D0629996-S1-20110104 Title: Glove backhand Assignee: Blackhawk Industries Product Group Unlimited LLC DocID: US-D0629997-S1-20110104 Title: Belt sleeve Assignee: None DocID: US-D0629998-S1-20110104 Title: Underwear Assignee: X-Technology Swiss GmbH DocID: US-D0629999-S1-20110104 Title: Portion of compression shorts Assignee: Nike, Inc. DocID: US-D0630000-S1-20110104 Title: Apparel Assignee: None DocID: US-D0630001-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630002-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630003-S1-20110104 Title: Hooded shirt Assignee: None DocID: US-D0630004-S1-20110104 Title: Headwear cap Assignee: None DocID: US-D0630005-S1-20110104 Title: Footwear Assignee: Vibram S.p.A.