如何在Python中从XML/SOAP提取数据
英国国家天然气系统(National Gas system)发布了大量数据,可以从SOAP服务器访问这些数据,下面显示了一个返回数据的示例(用于液化天然气)。我已经编写了生成请求和处理响应的代码,但在如何提取返回的信息方面遇到了麻烦。目的是将数据上传到后端数据库或数据帧中 在前面的代码中,我只是使用XPATH遍历XML,然后迭代标记并提取子数据。因此,我希望提取:如何在Python中从XML/SOAP提取数据,python,xml,xpath,soap,Python,Xml,Xpath,Soap,英国国家天然气系统(National Gas system)发布了大量数据,可以从SOAP服务器访问这些数据,下面显示了一个返回数据的示例(用于液化天然气)。我已经编写了生成请求和处理响应的代码,但在如何提取返回的信息方面遇到了麻烦。目的是将数据上传到后端数据库或数据帧中 在前面的代码中,我只是使用XPATH遍历XML,然后迭代标记并提取子数据。因此,我希望提取: GetPublicationDataWMResult, ApplicableAt, ApplicableFor, Value, ..
GetPublicationDataWMResult, ApplicableAt, ApplicableFor, Value, ...
LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...
LNG Capacity,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 6515042480, ...
尝试使用XPATH遍历子项(/Envelope/Body/GetPublicationDataWMResponse/GetPublicationDataWMResult/)失败
如果我通过添加一系列字符串删除来清理代码,那么逻辑就可以工作,但这是次优的,将来肯定会中断
示例代码:
import requests
from lxml import objectify
def getXML():
toDate = "2016-03-12"
fromDate = "2016-03-12"
dateType = "gasday"
url="http://marketinformation.natgrid.co.uk/MIPIws-public/public/publicwebservice.asmx"
headers = {'content-type': 'application/soap+xml; charset=utf-8'}
body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetPublicationDataWM xmlns="http://www.NationalGrid.com/MIPI/">
<reqObject>
<LatestFlag>Y</LatestFlag>
<ApplicableForFlag>Y</ApplicableForFlag>
<ToDate>%s</ToDate>
<FromDate>%s</FromDate>
<DateType>%s</DateType>
<PublicationObjectNameList>
<string>LNG Stock Level</string>
<string>LNG, Daily Aggregated Available Capacity, D+1</string>
</PublicationObjectNameList>
</reqObject>
</GetPublicationDataWM>
</soap12:Body>
</soap12:Envelope>""" % (toDate, fromDate,dateType)
response = requests.post(url,data=body,headers=headers)
return response.content
root = objectify.fromstring(getXML())
导入请求
从lxml导入objectify
def getXML():
toDate=“2016-03-12”
fromDate=“2016-03-12”
dateType=“gasday”
url=”http://marketinformation.natgrid.co.uk/MIPIws-public/public/publicwebservice.asmx"
headers={'content-type':'application/soap+xml;charset=utf-8'}
body=”“”
Y
Y
%
%
%
液化天然气库存水平
液化天然气,日累计可用容量,D+1
“%”(toDate、fromDate、dateType)
response=requests.post(url,data=body,headers=headers)
返回response.content
root=objectify.fromstring(getXML())
返回的XML:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetPublicationDataWMResponse
xmlns="http://www.NationalGrid.com/MIPI/">
<GetPublicationDataWMResult>
<CLSMIPIPublicationObjectBE>
<PublicationObjectName>LNG Stock Level</PublicationObjectName>
<PublicationObjectData>
<CLSPublicationObjectDataBE>
<ApplicableAt>2016-03-13T15:00:07Z</ApplicableAt>
<ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
<Value>7050.42286</Value>
<GeneratedTimeStamp>2016-03-13T15:56:00Z</GeneratedTimeStamp>
<QualityIndicator></QualityIndicator>
<Substituted>N</Substituted>
<CreatedDate>2016-03-13T15:56:28Z</CreatedDate>
</CLSPublicationObjectDataBE>
</PublicationObjectData>
</CLSMIPIPublicationObjectBE>
<CLSMIPIPublicationObjectBE>
<PublicationObjectName>LNG Capacity</PublicationObjectName>
<PublicationObjectData>
<CLSPublicationObjectDataBE>
<ApplicableAt>2016-03-12T15:30:00Z</ApplicableAt>
<ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
<Value>6515042480</Value>
<GeneratedTimeStamp>2016-03-12T16:00:00Z</GeneratedTimeStamp>
<QualityIndicator></QualityIndicator>
<Substituted>N</Substituted>
<CreatedDate>2016-03-12T16:00:20Z</CreatedDate>
</CLSPublicationObjectDataBE>
</PublicationObjectData>
</CLSMIPIPublicationObjectBE>
</GetPublicationDataWMResult>
</GetPublicationDataWMResponse>
</soap:Body>
</soap:Envelope>
液化天然气库存水平
2016-03-13T15:00:07Z
2016-03-12T00:00:00Z
7050.42286
2016-03-13T15:56:00Z
N
2016-03-13T15:56:28Z
液化天然气产能
2016-03-12T15:30:00Z
2016-03-12T00:00:00Z
6515042480
2016-03-12T16:00:00Z
N
2016-03-12T16:00:20Z
使用您现有的代码,我刚刚添加了以下内容:
res= getXML()
from bs4 import BeautifulSoup
soup = BeautifulSoup(res, 'html.parser')
searchTerms= ['PublicationObjectName','ApplicableAt','ApplicableFor','Value']
# LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...
for st in searchTerms:
print st+'\t',
print soup.find(st.lower()).contents[0]
输出:
PublicationObjectName LNG Stock Level
ApplicableAt 2016-03-13T15:00:07Z
ApplicableFor 2016-03-12T00:00:00Z
Value 7050.42286
这是XML+XPath主题中的常见问题解答,涉及带有默认名称空间的XML 声明默认名称空间的XML元素及其不带前缀的子元素隐式继承相同的默认名称空间。在XPath表达式中,要引用命名空间中的元素,需要使用已映射到相应命名空间URI的前缀。使用
lxml
代码大致如下:
root = etree.fromstring(getXML())
# map prefix 'd' to the default namespace URI
ns = { 'd': 'http://www.NationalGrid.com/MIPI/'}
publication_objects = root.xpath('//d:CLSMIPIPublicationObjectBE', namespaces=ns)
for obj in publication_objects:
name = obj.find('d:PublicationObjectName', ns).text
data = obj.find('d:PublicationObjectData/d:CLSPublicationObjectDataBE', ns)
applicable_at = data.find('d:ApplicableAt', ns).text
applicable_for = data.find('d:ApplicableFor', ns).text
# todo: extract other relevant data and process as needed