试图用Python解析非常糟糕的XML_Python_Xml_Beautifulsoup

试图用Python解析非常糟糕的XML

python xml

试图用Python解析非常糟糕的XML,python,xml,beautifulsoup,Python,Xml,Beautifulsoup,在购买域名后，我尝试使用python解析xml输出。到目前为止，我已经： #!/usr/bin/python import sys from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup file = sys.argv[1] xml = open(file).read() soup = BeautifulStoneSoup(xml) response = soup.find('ApiResponse') print respo

在购买域名后，我尝试使用python解析xml输出。到目前为止，我已经：

#!/usr/bin/python

import sys
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

file = sys.argv[1]
xml = open(file).read()
soup = BeautifulStoneSoup(xml)
response = soup.find('ApiResponse')

print response

我正在处理的XML输出格式非常不正确，肯定需要清理

ok: [162.243.95.241] => {"cache_control": "private", "changed": false, "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<ApiResponse Status=\"OK\" xmlns=\"http://api.namecheap.com/xml.response\">\r\n  <Errors />\r\n  <Warnings />\r\n  <RequestedCommand>namecheap.domains.create</RequestedCommand>\r\n  <CommandResponse Type=\"namecheap.domains.create\">\r\n    <DomainCreateResult Domain=\"123er321test.com\" Registered=\"true\" ChargedAmount=\"8.1800\" DomainID=\"33404\" OrderID=\"414562\" TransactionID=\"679462\" WhoisguardEnable=\"false\" FreePositiveSSL=\"false\" NonRealTimeDomain=\"false\" />\r\n  </CommandResponse>\r\n  <Server>WEB1-SANDBOX1</Server>\r\n  <GMTTimeDifference>--5:00</GMTTimeDifference>\r\n  <ExecutionTime>9.008</ExecutionTime>\r\n</ApiResponse>", "content_length": "647", "content_location": "https://api.sandbox.namecheap.com/xml.response", "content_type": "text/xml; charset=utf-8", "date": "Thu, 21 Nov 2013 03:23:51 GMT", "item": "", "redirected": false, "server": "Microsoft-IIS/7.0", "status": 200, "x_aspnet_version": "4.0.30319", "x_powered_by": "ASP.NET"}

ok:[162.243.95.241]=>{“缓存控制”：“私有”，“更改”：false，“内容”：“\r\n
我试图找到ApiResponse状态
，它要么是ERROR
要么是OK
那里的XML绝对没有问题
问题是XML嵌入到JSON中，JSON本身嵌入到某种我无法立即识别的对象中（我怀疑您刚刚从用于发出请求的任何框架中抛出了某种对象的repr
，这是一件愚蠢的事情……）
因此，以适当的方式解析顶级内容，不管它是什么格式。（如果您不知道它来自哪里，看起来您可以轻松地执行.partition（'=>'）[-1]
），然后使用JSON.loads
）解析JSON。然后获取['content']
的结果dict，即XML，您可以使用BeautifulSoup
对其进行解析。然后就完成了
换言之：
>>> thingy = r''' ok: [162.243.95.241] => {"cache_control": "private", "changed": false, "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<ApiResponse Status=\"OK\" xmlns=\"http://api.namecheap.com/xml.response\">\r\n  <Errors />\r\n  <Warnings />\r\n  <RequestedCommand>namecheap.domains.create</RequestedCommand>\r\n  <CommandResponse Type=\"namecheap.domains.create\">\r\n    <DomainCreateResult Domain=\"123er321test.com\" Registered=\"true\" ChargedAmount=\"8.1800\" DomainID=\"33404\" OrderID=\"414562\" TransactionID=\"679462\" WhoisguardEnable=\"false\" FreePositiveSSL=\"false\" NonRealTimeDomain=\"false\" />\r\n  </CommandResponse>\r\n  <Server>WEB1-SANDBOX1</Server>\r\n  <GMTTimeDifference>--5:00</GMTTimeDifference>\r\n  <ExecutionTime>9.008</ExecutionTime>\r\n</ApiResponse>", "content_length": "647", "content_location": "https://api.sandbox.namecheap.com/xml.response", "content_type": "text/xml; charset=utf-8", "date": "Thu, 21 Nov 2013 03:23:51 GMT", "item": "", "redirected": false, "server": "Microsoft-IIS/7.0", "status": 200, "x_aspnet_version": "4.0.30319", "x_powered_by": "ASP.NET"}'''
>>> j = thingy.partition('=>')[-1]
>>> obj = json.loads(j)
>>> xml = obj['content']
>>> soup = BeautifulSoup(xml)
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<apiresponse status="OK" xmlns="http://api.namecheap.com/xml.response">
<errors></errors>
<warnings></warnings>
<requestedcommand>namecheap.domains.create</requestedcommand>
<commandresponse type="namecheap.domains.create">
<domaincreateresult chargedamount="8.1800" domain="123er321test.com" domainid="33404" freepositivessl="false" nonrealtimedomain="false" orderid="414562" registered="true" transactionid="679462" whoisguardenable="false"></domaincreateresult>
</commandresponse>
<server>WEB1-SANDBOX1</server>
<gmttimedifference>--5:00</gmttimedifference>
<executiontime>9.008</executiontime>
</apiresponse>
>>> soup.find('apiresponse')['status']
'OK'

>>thingy=r''确定：[162.243.95.241]=>{“缓存控制”：“私有”，“更改”：false，“内容”：“\r\n\r\n\r\n\r\n名称便宜.域。创建\r\n\r\n\r\n\r\n WEB1-SANDBOX1\r\n--5:00\r\n 9.008\r\n”，“内容长度”：“647”，“内容位置”：https://api.sandbox.namecheap.com/xml.response，“内容类型”：text/xml；charset=utf-8，“日期”：“Thu，2013年11月21日03:23:51 GMT”，“项”：“重定向”：false，“服务器”：“Microsoft IIS/7.0”，“状态”：200，“x_aspnet_版本”：“4.0.30319”，“x_受电人”：“ASP.NET”}”
>>>j=thingy.partition（'=>'）[-1]
>>>obj=json.loads（j）
>>>xml=obj['content']
>>>soup=BeautifulSoup（xml）
>>>汤
namescape.domains.create
WEB1-SANDBOX1
--5:00
9.008
>>>soup.find（'apiresponse'）['status']
“好的”
那里的XML绝对没有问题
问题是XML嵌入到JSON中，JSON本身嵌入到某种我无法立即识别的对象中（我怀疑您刚刚从用于发出请求的任何框架中抛出了某种对象的repr
，这是一件愚蠢的事情……）
因此，以适当的方式解析顶级内容，不管它是什么格式。（如果您不知道它来自哪里，看起来您可以轻松地执行.partition（'=>'）[-1]
），然后使用JSON.loads
）解析JSON。然后获取['content']
的结果dict，即XML，您可以使用BeautifulSoup
对其进行解析。然后就完成了
换言之：
>>> thingy = r''' ok: [162.243.95.241] => {"cache_control": "private", "changed": false, "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<ApiResponse Status=\"OK\" xmlns=\"http://api.namecheap.com/xml.response\">\r\n  <Errors />\r\n  <Warnings />\r\n  <RequestedCommand>namecheap.domains.create</RequestedCommand>\r\n  <CommandResponse Type=\"namecheap.domains.create\">\r\n    <DomainCreateResult Domain=\"123er321test.com\" Registered=\"true\" ChargedAmount=\"8.1800\" DomainID=\"33404\" OrderID=\"414562\" TransactionID=\"679462\" WhoisguardEnable=\"false\" FreePositiveSSL=\"false\" NonRealTimeDomain=\"false\" />\r\n  </CommandResponse>\r\n  <Server>WEB1-SANDBOX1</Server>\r\n  <GMTTimeDifference>--5:00</GMTTimeDifference>\r\n  <ExecutionTime>9.008</ExecutionTime>\r\n</ApiResponse>", "content_length": "647", "content_location": "https://api.sandbox.namecheap.com/xml.response", "content_type": "text/xml; charset=utf-8", "date": "Thu, 21 Nov 2013 03:23:51 GMT", "item": "", "redirected": false, "server": "Microsoft-IIS/7.0", "status": 200, "x_aspnet_version": "4.0.30319", "x_powered_by": "ASP.NET"}'''
>>> j = thingy.partition('=>')[-1]
>>> obj = json.loads(j)
>>> xml = obj['content']
>>> soup = BeautifulSoup(xml)
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<apiresponse status="OK" xmlns="http://api.namecheap.com/xml.response">
<errors></errors>
<warnings></warnings>
<requestedcommand>namecheap.domains.create</requestedcommand>
<commandresponse type="namecheap.domains.create">
<domaincreateresult chargedamount="8.1800" domain="123er321test.com" domainid="33404" freepositivessl="false" nonrealtimedomain="false" orderid="414562" registered="true" transactionid="679462" whoisguardenable="false"></domaincreateresult>
</commandresponse>
<server>WEB1-SANDBOX1</server>
<gmttimedifference>--5:00</gmttimedifference>
<executiontime>9.008</executiontime>
</apiresponse>
>>> soup.find('apiresponse')['status']
'OK'

>>thingy=r''确定：[162.243.95.241]=>{“缓存控制”：“私有”，“更改”：false，“内容”：“\r\n\r\n\r\n\r\n名称便宜.域。创建\r\n\r\n\r\n\r\n WEB1-SANDBOX1\r\n--5:00\r\n 9.008\r\n”，“内容长度”：“647”，“内容位置”：https://api.sandbox.namecheap.com/xml.response，“内容类型”：text/xml；charset=utf-8，“日期”：“Thu，2013年11月21日03:23:51 GMT”，“项”：“重定向”：false，“服务器”：“Microsoft IIS/7.0”，“状态”：200，“x_aspnet_版本”：“4.0.30319”，“x_受电人”：“ASP.NET”}”
>>>j=thingy.partition（'=>'）[-1]
>>>obj=json.loads（j）
>>>xml=obj['content']
>>>soup=BeautifulSoup（xml）
>>>汤
namescape.domains.create
WEB1-SANDBOX1
--5:00
9.008
>>>soup.find（'apiresponse'）['status']
“好的”
在“=>”之后的所有内容都是JSON，尝试使用JSON解析器来提取xml属性，然后您需要xml解析器来破解它。这仍然是吗？-如果是这样，就不需要有两个问题问同一件事了…我想在我弄明白这一点后，自己回答这个问题，因为我意识到我需要做的是wri一个可解析的模块，这就是我提出这个问题的原因。在“=>”之后的所有内容都是JSON，尝试使用JSON解析器来提取xml属性，然后您将需要一个xml解析器来破解它。这仍然是吗？-如果是这样的话，就不需要有两个问题问同一件事了…一旦我想到了这一点，我会自己回答这个问题因为我意识到我需要做的是编写一个Ansible模块，这就是我提出这个问题的原因。谢谢@abarnert！这确实做到了，但我需要用thingy.partition
替换第二行中的z.partition
。帮了大忙！@DavidNeudorfer:对不起，是的，我先把它放在一个名为z
的变量中，然后是stuc在thingy
中添加它，因为它看起来更可读，并且复制并粘贴了错误的行。修复。感谢@abarnert！这确实做到了，但我需要将第二行中的z.partition
替换为thingy.partition
。巨大的帮助！@DavidNeudorfer:对不起，是的，我首先将它插入了一个名为z
的变量中，然后将其粘贴到thingy
中，因为这看起来更可读，并复制和粘贴了错误的行。修复。