Python从xml中提取数据_Python_Python 3.x_Xml_Xml Parsing_Urllib

Python从xml中提取数据

python python-3.x xml

Python从xml中提取数据,python,python-3.x,xml,xml-parsing,urllib,Python,Python 3.x,Xml,Xml Parsing,Urllib,我正在尝试从此网页获取值： This XML file does not appear to have any style information associated with it. The document tree is shown below. <ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/X

我正在尝试从此网页获取值：

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://tempuri.org/">
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-01T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28671555</Value>
<ValueDetail>4415</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-02T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28675970</Value>
<ValueDetail>4279</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-03T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28680249</Value>
<ValueDetail>3975</ValueDetail>
</vwHistoryDetail>
<vwHistoryDetail>
<idVariable>2561</idVariable>
<DateTime>2020-12-04T00:00:00</DateTime>
<idPeriodType>1</idPeriodType>
<Value>28684224</Value>
<ValueDetail>4236</ValueDetail>
</vwHistoryDetail>
</ArrayOfVwHistoryDetail>

出于安全原因，我隐藏了完整的URL。我做错了什么，没有得到这些值？我需要完成这部分，这样我就可以用DataTime:Value构建一个字典

先谢谢你

tree = ET.fromstring(data)
for detail in tree.findall('vwHistoryDetail'):
  v = detail.find('Value').text
  print(v)

最好通过对象循环并提取子元素，而不是直接获取子元素，因为值可能是在文档的不同部分重用的标记

import xml.etree.ElementTree as ET
import re

#
xml = '''<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                 xmlns="http://tempuri.org/">
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-01T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28671555</Value>
      <ValueDetail>4415</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-02T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28675970</Value>
      <ValueDetail>4279</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-03T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28680249</Value>
      <ValueDetail>3975</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-04T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28684224</Value>
      <ValueDetail>4236</ValueDetail>
   </vwHistoryDetail>
</ArrayOfVwHistoryDetail>'''
xml = re.sub(' xmlns="[^"]+"', '', xml, count=1)
root = ET.fromstring(xml)
data = {v.find('DateTime').text: v.find('Value').text for v in root.findall('.//vwHistoryDetail')}
print(data)

您当前的实施中出现了几个问题：

您的XML包含一个默认名称空间，
```
xmlns=”http://tempuri.org/“
```
这要求您定义前缀以解析节点内容
```
findall
```
维护名称空间参数
路径表达式假定
```
Value
```
是root的子级。您需要使用双斜杠路径，
```
/
```
，因为
```
Value
```
是root的后代

您需要提取迭代器变量的

文本。否则，您将返回在最终使用需求中通常不有用的
对象


考虑调整
tree=ET.fromstring（数据）
nmsp={'doc'：'http://tempuri.org/'}#名称空间前缀赋值
results=tree.findall（'.//doc:Value'，namespaces=nmsp）#名称空间前缀与'.//'路径一起使用
对于结果中的i：
打印（i.text）#检索文本值
# 28671555
# 28675970
# 28680249
# 28684224

更好的是，返回一个包含.Value
的字典及其具有list/dict comprehension的同级字典（其中split
删除dict键中的默认名称空间）：
dicts的数据列表=[{i.tag.split（'}'）[-1]：hd中i的i.text}
对于tree.findall（'.//doc:vwhistorydeail'，namespaces=nmsp）中的hd
打印（数据列表）
#[{'idVariable'：'2561'，'DateTime'：'2020-12-01T00:00:00'，'idPeriodType'：'1'，'Value'：'28671555'，'ValueDetail'：'4415'}，
#{'idVariable'：'2561'，'DateTime'：'2020-12-02T00:00:00'，'idPeriodType'：'1'，'Value'：'28675970'，'ValueDetail'：'4279'}，
#{'idVariable'：'2561'，'DateTime'：'2020-12-03T00:00:00'，'idPeriodType'：'1'，'Value'：'28680249'，'ValueDetail'：'3975'}，
#{'idVariable'：'2561'，'DateTime'：'2020-12-04T00:00:00'，'idPeriodType'：'1'，'Value'：'28684224'，'ValueDetail'：'4236'}]

对于时间键控值字典：

time\u value\u dict={hd.find（'doc:DateTime'，namespace=nmsp）。文本：
查找（'doc:Value'，名称空间=nmsp）.text
对于tree.findall（'.//doc:vwhistorydeail'，namespaces=nmsp）中的hd
打印（时间值记录）
#{'2020-12-01T00:00:00'：'28671555'，
#“2020-12-02T00:00:00”：“28675970”，
#“2020-12-03T00:00:00”：“28680249”，
#“2020-12-04T00:00:00”：“28684224”

我认为您不需要页面的HTML…如果您使用

请求

库，您可以像这样获取数据：

请求.get（url）.content

。请注意，您必须安装requests vie pip或类似的程序。XML文件在开始时可能无法正确解析为“此XML文件似乎没有…”。当我打印

requests.get（url）.content

时，我得到以下信息：

b'\r\n\r\n\r\n 2561\r\n 2020-12-01T00:00:00\r\n 1\r\n 28671555\r\n 4415\r\n\r\n'

我仍然没有得到任何值。这看起来是正确的。XML不关心空白，例如\r\n，这些空白只是换行符。我测试了您的建议，但仍然没有打印任何值。还是谢谢你？所有符合要求的DOM库都应该处理默认名称空间，而不需要将其从树中删除。谢谢@Parfait，我尝试根据@balderman的建议更改代码，以仅获取DataTime:Value对：

tree=ET.fromstring（data）nmsp={'doc'：'http://tempuri.org/'}#名称空间前缀赋值结果=tree.findall（'.//doc:Value'，namespaces=nmsp）#名称空间前缀与路径DataTimeValue一起使用.//PATH dict=[{i.find（'DateTime'）。text:i.find（'Value'）。hd中的i文本}树中的hd findall（'.//doc vwhistorydealture'，namespaces=nmsp）]

Output:File“g:/My Drive/Projectos/Python/teste/get.py”，第19行，在DataTimeValue\u dict=[{i.find（'DateTime'）。text:i.find（'Value'）。在hd}文件“g:/My Drive/Projectos/Python/teste/get.py”中，第19行，在DataTimeValue\u dict=[{i.find（'DateTime'）。text:i.find（'Value'）。在hd}文件中，我的文本“g:/My Drive/Projectos/Python/teste/get.py”，DataTimeValue_dict=[{i.find（'DateTime'）中的第19行。text:i.find（'Value'）。hd}AttributeError中的i文本：'NoneType'对象没有属性'text'PS g:\My Drive\Projectos\Python\teste>@NunoFélix，您的字典理解不正确。请尝试

d={hd find.find={（'./doc:DateTime'，namespaces=nmsp）.text:hd.find（'./doc:Value'，namespaces=nmsp）.tree.findall（'.//doc:vwhistorydail'，namespaces=nmsp）中hd的文本

您需要在

中通过名称空间。查找就像您在中所做的那样。findall。请参阅编辑和演示更新。感谢@Parfait和所有帮助我的人，最后，它以我所寻找的方式工作。我刚开始在Udemy学习Python，所以我犯了很多新手错误。
import xml.etree.ElementTree as ET
import re

#
xml = '''<ArrayOfVwHistoryDetail xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                 xmlns="http://tempuri.org/">
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-01T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28671555</Value>
      <ValueDetail>4415</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-02T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28675970</Value>
      <ValueDetail>4279</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-03T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28680249</Value>
      <ValueDetail>3975</ValueDetail>
   </vwHistoryDetail>
   <vwHistoryDetail>
      <idVariable>2561</idVariable>
      <DateTime>2020-12-04T00:00:00</DateTime>
      <idPeriodType>1</idPeriodType>
      <Value>28684224</Value>
      <ValueDetail>4236</ValueDetail>
   </vwHistoryDetail>
</ArrayOfVwHistoryDetail>'''
xml = re.sub(' xmlns="[^"]+"', '', xml, count=1)
root = ET.fromstring(xml)
data = {v.find('DateTime').text: v.find('Value').text for v in root.findall('.//vwHistoryDetail')}
print(data)

{'2020-12-01T00:00:00': '28671555', '2020-12-02T00:00:00': '28675970', '2020-12-03T00:00:00': '28680249', '2020-12-04T00:00:00': '28684224'}