Python 使用lxml从xml中提取数据的最有效方法_Python_Xml_Xpath_Lxml_Python 3.3

Python 使用lxml从xml中提取数据的最有效方法

python xml xpath

Python 使用lxml从xml中提取数据的最有效方法,python,xml,xpath,lxml,python-3.3,Python,Xml,Xpath,Lxml,Python 3.3,下面是一个大型xml文件的片段。我想提取特定的名称空间，例如xmlns:dc=”http://purl.org/dc/elements/1.1/“。目前，我能够做到以下几点： tree = etree.parse(file) for element in tree.getiterator('{http://www.openarchives.org/OAI/2.0/}record'): for leaf in element.getiterator('{http://pur

下面是一个大型xml文件的片段。我想提取特定的名称空间，例如

xmlns:dc=”http://purl.org/dc/elements/1.1/“

。目前，我能够做到以下几点：

tree = etree.parse(file)
    for element in tree.getiterator('{http://www.openarchives.org/OAI/2.0/}record'):
        for leaf in element.getiterator('{http://purl.org/dc/elements/1.1/}subject'):
            print(leaf)

问题是我希望获得{}名称空间中多个标记的数据。我还想简化一些事情，并一直在研究如何使用xpath，但似乎无法理解它。我可以使用xpath吗？如果可以，如何使用，更重要的是，它是否更适合我的目标

以下是xml：

<?xml version="1.0" encoding="UTF-8" ?>



<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2013-08-15T23:24:55Z</responseDate>
<request verb="ListRecords" resumptionToken="0/500/121403/nsdl_dc/null/null/null">http://nsdldev.org/oai</request>

<!-- Showing records 501 through 1000 out of 121403 total  -->

<ListRecords>


  <record>
    <header>
      <identifier>oai:nsdl.org:2200/20110926115158975T</identifier>
      <datestamp>2013-05-29T16:44:49Z</datestamp>
       <setSpec>ncs-NSDL-COLLECTION-000-003-112-056</setSpec>
      </header>
    <metadata>
    <nsdl_dc:nsdl_dc xmlns:nsdl_dc="http://ns.nsdl.org/nsdl_dc_v1.02/"
                 xmlns:dc="http://purl.org/dc/elements/1.1/"
                 xmlns:dct="http://purl.org/dc/terms/"
                 xmlns:lar="http://ns.nsdl.org/schemas/dc/lar"
                 xmlns:ieee="http://www.ieee.org/xsd/LOMv1p0"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 schemaVersion="1.02.020"
                 xsi:schemaLocation="http://ns.nsdl.org/nsdl_dc_v1.02/ http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.02.xsd">
   <lar:readiness xsi:type="lar:Ready">Fully ready</lar:readiness>
   <dc:identifier xsi:type="dct:URI">http://www.exo.net/~emuller/activities/Hot%20Sauce%20Hot%20Spots.pdf</dc:identifier>
   <dc:relation xsi:type="nsdl_dc:NSDLPartnerURL">http://howtosmile.org/record/4427</dc:relation>
   <dc:title>Hot Sauce Hot Spots</dc:title>
   <dc:description>In this activity, learners model hot spot island formation, orientation and progression with condiments. Learners squirt a thick condiment sauce on a coarsely woven fabric to model how volcanic island hot spots form.</dc:description>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Oceanography</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Anthropology</dc:subject>
   <dc:subject>Physical science</dc:subject>
   <dc:subject>Physics</dc:subject>
   <dc:subject>General science</dc:subject>
   <dc:subject>hot spot island</dc:subject>
   <dc:subject>volcano</dc:subject>
   <dc:subject>tectonic plates</dc:subject>
   <dc:subject>Earth</dc:subject>
   <dc:subject>molten</dc:subject>
   <dc:subject>magma</dc:subject>
   <dc:subject>eruption</dc:subject>
   <dc:subject>undersea</dc:subject>
   <dc:subject>ocean</dc:subject>
   <dc:subject>island</dc:subject>
   <dc:subject>Earth Processes</dc:subject>
   <dc:subject>Volcanoes and Plate Tectonics</dc:subject>
   <dc:subject>Earth Structure</dc:subject>
   <dc:subject>Rocks and Minerals</dc:subject>
   <dc:subject>Oceans and Water</dc:subject>
   <dc:subject>Geologic Time</dc:subject>
   <dc:subject>Heat and Temperature</dc:subject>
   <dc:subject>Conducting Investigations</dc:subject>
   <dc:language>en-US</dc:language>
   <dc:format>application/pdf</dc:format>
   <lar:accessMode xsi:type="lar:ModeAcc">visual</lar:accessMode>
   <lar:accessMode xsi:type="lar:ModeAcc">tactile</lar:accessMode>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Upper Elementary</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Middle School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">High School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Informal Education</dct:educationLevel>
   <dct:audience xsi:type="nsdl_dc:NSDLAudience">Learner</dct:audience>
   <dc:type xsi:type="nsdl_dc:NSDLType">Activity</dc:type>
   <dc:type xsi:type="nsdl_dc:NSDLType">Model</dc:type>
   <dct:isPartOf>http://www.exo.net/~emuller/activities/index.html</dct:isPartOf>
   <dc:date xsi:type="dct:W3CDTF">2007</dc:date>
   <dc:creator>Eric Muller</dc:creator>
   <dc:contributor>The Exploratorium</dc:contributor>
   <dct:accessRights xsi:type="nsdl_dc:NSDLAccess">Free access</dct:accessRights>
   <dc:rights>Copyright 2007 Do Science</dc:rights>
   <dct:license>Owner license</dct:license>
   <lar:licenseProperty xsi:type="lar:LicProp">Terms of use unknown</lar:licenseProperty>
   <dct:rightsHolder>Do Science</dct:rightsHolder>
   <lar:metadataTerms>The following entity, University Corporation for Atmospheric Research (UCAR), has claims on the use of this metadata. This claim is as follows: The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. The entity provided more information at: http://nsdl.org/help/terms-of-use</lar:metadataTerms>
   <lar:metadataTerms>The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. More information is available at: http://nsdl.org/help/terms-of-use.</lar:metadataTerms>
</nsdl_dc:nsdl_dc>

    </metadata>
  </record>


2013-08-15T23:24:55Z
http://nsdldev.org/oai
oai:nsdl.org:2200/2011092615158975T
2013-05-29T16:44:49Z
ncs-NSDL-COLLECTION-000-003-112-056
准备充分
http://www.exo.net/~emuller/activities/Hot%20Sauce%20Hot%20Spots.pdf
http://howtosmile.org/record/4427
热点
在这项活动中，学习者用调味品模拟热点岛的形成、方向和进展。学习者在粗糙的织物上喷上厚厚的调味品酱来模拟火山岛热点的形成。
地球科学
地球系统科学
地球科学
地球系统科学
地球科学
地球科学
地球科学
海洋学
地球系统科学
地球科学
人类学
物理科学
物理
普通科学
热点岛
火山
构造板块
土
熔化的
岩浆
喷发
海底
海洋
岛
地球过程
火山和板块构造
地球结构
岩石和矿物
海洋和水
地质时代
热量和温度
进行调查
恩美
申请表格/pdf
视觉的
触觉的
上小学
中学
高中
非正规教育
学习者
活动
模型
http://www.exo.net/~emuller/activities/index.html
2007
埃瑞克·穆勒
探险馆
免费访问
版权所有2007 Do Science
所有者许可证
使用条款未知
做科学
以下实体，大学大气研究公司（UCAR），声称使用该元数据。该声明如下：位于大学大气研究公司（UCAR）的国家科学数字图书馆（NSDL）提供了这些元数据术语：这些数据和元数据不得复制、复制、复制、出售或以其他方式用于NSDL未明确允许的任何商业目的。该实体在以下网址提供了更多信息：http://nsdl.org/help/terms-of-use
位于大学大气研究公司（UCAR）的国家科学数字图书馆（NSDL）提供了这些元数据术语：这些数据和元数据不得复制、复制、复制、出售或以其他方式用于NSDL未明确允许的任何商业目的。有关更多信息，请访问：http://nsdl.org/help/terms-of-use.

不清楚您到底想访问什么，但请尝试以下操作：

from lxml import etree
doc=etree.parse( xmlfile )
ns={'dc': 'http://purl.org/dc/elements/1.1/', 
  'oai': 'http://www.openarchives.org/OAI/2.0/'}
doc.xpath( '//dc:subject' , namespaces=ns ) # get all of the dc:subjects
doc.xpath( '//dc:*', namespaces=ns )  # get all elements in dc: namespace
# more specific path 
doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*/dc:*', namespaces=ns )
x=doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*' )
x[0].xpath( '*[contains(.,"Geo")]' )  # you can also call xpath from non document nodes
x[0].xpath( 'dc:subject/text()' , namespaces=ns ) # get the text of dc:subjects

阅读一些python或lxml文档之外的xpath文档。它们告诉您如何在python中使用xpath，但它们并不是真正的xpath教程

请注意find（）、findall（）方法采用

类xpath表达式的有限子集

不清楚您到底想访问什么，但请尝试以下操作：

from lxml import etree
doc=etree.parse( xmlfile )
ns={'dc': 'http://purl.org/dc/elements/1.1/', 
  'oai': 'http://www.openarchives.org/OAI/2.0/'}
doc.xpath( '//dc:subject' , namespaces=ns ) # get all of the dc:subjects
doc.xpath( '//dc:*', namespaces=ns )  # get all elements in dc: namespace
# more specific path 
doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*/dc:*', namespaces=ns )
x=doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*' )
x[0].xpath( '*[contains(.,"Geo")]' )  # you can also call xpath from non document nodes
x[0].xpath( 'dc:subject/text()' , namespaces=ns ) # get the text of dc:subjects

阅读一些python或lxml文档之外的xpath文档。它们告诉您如何在python中使用xpath，但它们并不是真正的xpath教程

请注意find（）、findall（）方法采用

类xpath表达式的有限子集

知道如何使用xpath吗？

findall

的参数是xpath，不是吗？老实说，不确定。如果我将代码放入xpath（），代码将失败。我一直在阅读使用xpath（'/foo/bar'）。

.xpath（'/*/*/*/*/*/*/*/*/*/*/*/*/*'）等方法。

让我达到我想要的水平，但我尝试用我认为是节点名的东西替换

，但仍然不起作用。不可能理解您想要什么。请明确指定要实现的结果。知道如何使用xpath实现吗？

findall

的参数是xpath，不是吗？不一定要诚实。如果我将代码放入xpath（），代码将失败。我一直在阅读使用xpath（'/foo/bar'）。

.xpath（'/*/*/*/*/*/*/*/*/*/*/*/*/*'）等方法。

让我达到我想要的水平，但我尝试用我认为是节点名的东西替换

，但仍然不起作用。不可能理解您想要什么。请明确说明您想要达到的效果。这很有效，谢谢。但是，在处理大型文件时，xpath的速度似乎很慢。你能解释一下原因吗？迭代要快得多。这很有效，谢谢。但是，在处理大型文件时，xpath的速度似乎很慢。你能解释一下原因吗？迭代要快得多。