如何用Python解析复杂的XML

如何用Python解析复杂的XML,python,xml,parsing,xml-parsing,export-to-csv,Python,Xml,Parsing,Xml Parsing,Export To Csv,我正在将XML文件转换为CSV或pandas文件。XML中有各种必要的类别,也有其他不需要的类别。是否有一种有效的方法来挑选下面格式的代码中的信息。这需要在大于10000个文档的相对较大规模上完成。例如,我想获得“家庭id”、“数据”和 美国 20030137706 A1 20030724 美国 18203002 A. 20021204 胡 0000532 A. 20000207 技术领域 [0001]本发明的目的是一种全息成像方法 记录数据。在该方法中,包含日期的全息图是 在波导层中记录为物

我正在将XML文件转换为CSV或pandas文件。XML中有各种必要的类别,也有其他不需要的类别。是否有一种有效的方法来挑选下面格式的代码中的信息。这需要在大于10000个文档的相对较大规模上完成。例如,我想获得“家庭id”、“数据”和


美国
20030137706
A1
20030724
美国
18203002
A.
20021204
胡
0000532
A.
20000207
技术领域
[0001]本发明的目的是一种全息成像方法
记录数据。在该方法中,包含日期的全息图是
在波导层中记录为物体光束之间的干涉
和参考光束。对象光束基本上垂直于
全息图的平面,而参考光束在
波导管。还提出了一种用于执行上述操作的装置
方法。该装置包括具有波导的数据存储介质
全息存储层,以及用于写入和读取的光学系统
全息图。该光学系统包括用于产生光学元件的装置
目标光束和参考光束,并对目标光束和参考光束进行成像
存储介质上的参考光束

背景艺术 [0002]使用磁带实现的存储系统与众不同 存储系统具有巨大的存储容量。此类系统 用于实现TB级的数据存储。 这种巨大的存储容量部分是由存储密度实现的, 部分取决于存储磁带的长度。相对空间 磁带的需求量很小,因为它们可能会卷成一个卷 体积很小。它们的缺点是相对较大的随机性 访问时间


我强烈建议使用优秀的库!它非常快,因为它是围绕C库libxml2和libxslt的包装器

用法示例:

import lxml.etree  

text = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n  PUBLIC "-//MXW//DTD patent-document XML//EN" 
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new" 
date-produced="20090605" date="20030724">
  <bibliographic-data>
    <publication-reference ucid="US-20030137706-A1" status="new" 
     fvid="76030147">
  <document-id status="new" format="original">
    <country>US</country>
    <doc-number>20030137706</doc-number>
    <kind>A1</kind>
    <date>20030724</date>
  </document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
  <document-id status="new" format="epo">
    <country>US</country>
    <doc-number>18203002</doc-number>
    <kind>A</kind>
    <date>20021204</date>
  </document-id>
</application-reference>
<priority-claims status="new">
  <priority-claim ucid="HU-0000532-A" status="new">
    <document-id status="new" format="epo">
      <country>HU</country>
      <doc-number>0000532</doc-number>
      <kind>A</kind>
      <date>20000207</date>
    </document-id>
  </priority-claim>
  <description load-source="us" status="new" lang="EN">
     <heading>TECHNICAL FIELD </heading>
     <p>[0001] The object of the invention is a method for the holographic 
     recording of data. In the method a hologram containing the date is 
     recorded in a waveguide layer as an interference between an object beam 
     and a reference beam. The object beam is essentially perpendicular to 
     the plane of the hologram, while the reference beam is coupled in the 
     waveguide. There is also proposed an apparatus for performing the 
     method. The apparatus comprises a data storage medium with a waveguide 
     holographic storage layer, and an optical system for writing and reading 
     the holograms. The optical system comprises means for producing an 
     object beam and a reference beam, and imaging the object beam and a 
     reference beam on the storage medium. </p>
     <heading>BACKGROUND ART </heading>
      <p>[0002] Storage systems realised with tapes stand out from other data 
      storage systems regarding their immense storage capacity. Such systems 
      were used to realise the storage of data in the order of Terabytes. 
      This large storage capacity is achieved partly by the storage density, 
      and partly by the length of the storage tapes. The relative space 
      requirements of tapes are small, because they may be wound up into a 
      very small volume. Their disadvantage is the relatively large random 
      access time. </p>
  </description>
</priority-claims>
</bibliographic-data>
</patent-document>
'''.encode('utf-8') # the library wants bytes so we encode
#  ^^ you don't need this if reading from a file

doc = lxml.etree.fromstring(text)

我建议看一下这个,它是否也适用于下面的段落,包括标题技术字段?它将解析整个文档,是的
doc.xpath(“//专利文档/书目数据/优先权要求/描述/标题/文本()”
将返回
[“技术领域”,“背景艺术”]
import lxml.etree  

text = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n  PUBLIC "-//MXW//DTD patent-document XML//EN" 
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new" 
date-produced="20090605" date="20030724">
  <bibliographic-data>
    <publication-reference ucid="US-20030137706-A1" status="new" 
     fvid="76030147">
  <document-id status="new" format="original">
    <country>US</country>
    <doc-number>20030137706</doc-number>
    <kind>A1</kind>
    <date>20030724</date>
  </document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
  <document-id status="new" format="epo">
    <country>US</country>
    <doc-number>18203002</doc-number>
    <kind>A</kind>
    <date>20021204</date>
  </document-id>
</application-reference>
<priority-claims status="new">
  <priority-claim ucid="HU-0000532-A" status="new">
    <document-id status="new" format="epo">
      <country>HU</country>
      <doc-number>0000532</doc-number>
      <kind>A</kind>
      <date>20000207</date>
    </document-id>
  </priority-claim>
  <description load-source="us" status="new" lang="EN">
     <heading>TECHNICAL FIELD </heading>
     <p>[0001] The object of the invention is a method for the holographic 
     recording of data. In the method a hologram containing the date is 
     recorded in a waveguide layer as an interference between an object beam 
     and a reference beam. The object beam is essentially perpendicular to 
     the plane of the hologram, while the reference beam is coupled in the 
     waveguide. There is also proposed an apparatus for performing the 
     method. The apparatus comprises a data storage medium with a waveguide 
     holographic storage layer, and an optical system for writing and reading 
     the holograms. The optical system comprises means for producing an 
     object beam and a reference beam, and imaging the object beam and a 
     reference beam on the storage medium. </p>
     <heading>BACKGROUND ART </heading>
      <p>[0002] Storage systems realised with tapes stand out from other data 
      storage systems regarding their immense storage capacity. Such systems 
      were used to realise the storage of data in the order of Terabytes. 
      This large storage capacity is achieved partly by the storage density, 
      and partly by the length of the storage tapes. The relative space 
      requirements of tapes are small, because they may be wound up into a 
      very small volume. Their disadvantage is the relatively large random 
      access time. </p>
  </description>
</priority-claims>
</bibliographic-data>
</patent-document>
'''.encode('utf-8') # the library wants bytes so we encode
#  ^^ you don't need this if reading from a file

doc = lxml.etree.fromstring(text)
>>> print(doc.xpath('//patent-document/@family-id'))
['10973265']
>>> print(doc.xpath('//patent-document/@date'))
['20030724']