Python 将xml反序列化为对象时出现问题-不希望被特殊字符分割
我试图将xml反序列化为对象,但在xml树中对各种项目进行编码时遇到了问题 XML示例:Python 将xml反序列化为对象时出现问题-不希望被特殊字符分割,python,xml,encoding,deserialization,Python,Xml,Encoding,Deserialization,我试图将xml反序列化为对象,但在xml树中对各种项目进行编码时遇到了问题 XML示例: <?xml version="1.0" encoding="utf-8"?> <results> <FlightTravel> <QuantityOfPassengers>6</QuantityOfPassengers> <Id>N5GWXM</Id> <InsuranceId>330
<?xml version="1.0" encoding="utf-8"?>
<results>
<FlightTravel>
<QuantityOfPassengers>6</QuantityOfPassengers>
<Id>N5GWXM</Id>
<InsuranceId>330992</InsuranceId>
<TotalTime>3h 00m</TotalTime>
<TransactionPrice>540.00</TransactionPrice>
<AdditionalPrice>0</AdditionalPrice>
<InsurancePrice>226.56</InsurancePrice>
<TotalPrice>9561.31</TotalPrice>
<CompanyName>XXXXX</CompanyName>
<TaxID>111-11-11-111</TaxID>
<InvoiceStreet>Jagiellońska</InvoiceStreet>
<InvoiceHouseNo>8</InvoiceHouseNo>
<InvoiceZipCode>Jagiellońska</InvoiceZipCode>
<InvoiceCityName>Warszawa</InvoiceCityName>
<PayerStreet>Jagiellońska</PayerStreet>
<PayerHouseNo>8</PayerHouseNo>
<PayerZipCode>11-111</PayerZipCode>
<PayerCityName>Warszawa</PayerCityName>
<PayerEmail>no-reply@xxxx.pl</PayerEmail>
<PayerPhone>123123123</PayerPhone>
<Segments>
<Segment0>
<DepartureAirport>WAW</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>07:50</DepartureTime>
<ArrivalAirport>VIE</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>09:15</ArrivalTime>
</Segment0>
<Segment1>
<DepartureAirport>VIE</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>10:00</DepartureTime>
<ArrivalAirport>SZG</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>10:50</ArrivalTime>
</Segment1>
</Segments>
</FlightTravel>
</results>
# -*- coding: utf-8 -*-
from lxml import etree
import codecs
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True #if tag == 'Title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data)
def close(self):
return self.text
parser = etree.XMLParser(target = TitleTarget())
infile = 'Flights.xml'
results = etree.parse(infile, parser)
out = open('wynik.txt', 'w')
out.write('\n'.join(results))
out.close()
输出:
<?xml version="1.0" encoding="utf-8"?>
<results>
<FlightTravel>
<QuantityOfPassengers>6</QuantityOfPassengers>
<Id>N5GWXM</Id>
<InsuranceId>330992</InsuranceId>
<TotalTime>3h 00m</TotalTime>
<TransactionPrice>540.00</TransactionPrice>
<AdditionalPrice>0</AdditionalPrice>
<InsurancePrice>226.56</InsurancePrice>
<TotalPrice>9561.31</TotalPrice>
<CompanyName>XXXXX</CompanyName>
<TaxID>111-11-11-111</TaxID>
<InvoiceStreet>Jagiellońska</InvoiceStreet>
<InvoiceHouseNo>8</InvoiceHouseNo>
<InvoiceZipCode>Jagiellońska</InvoiceZipCode>
<InvoiceCityName>Warszawa</InvoiceCityName>
<PayerStreet>Jagiellońska</PayerStreet>
<PayerHouseNo>8</PayerHouseNo>
<PayerZipCode>11-111</PayerZipCode>
<PayerCityName>Warszawa</PayerCityName>
<PayerEmail>no-reply@xxxx.pl</PayerEmail>
<PayerPhone>123123123</PayerPhone>
<Segments>
<Segment0>
<DepartureAirport>WAW</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>07:50</DepartureTime>
<ArrivalAirport>VIE</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>09:15</ArrivalTime>
</Segment0>
<Segment1>
<DepartureAirport>VIE</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>10:00</DepartureTime>
<ArrivalAirport>SZG</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>10:50</ArrivalTime>
</Segment1>
</Segments>
</FlightTravel>
</results>
# -*- coding: utf-8 -*-
from lxml import etree
import codecs
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True #if tag == 'Title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data)
def close(self):
return self.text
parser = etree.XMLParser(target = TitleTarget())
infile = 'Flights.xml'
results = etree.parse(infile, parser)
out = open('wynik.txt', 'w')
out.write('\n'.join(results))
out.close()
[6',N5GWXM',330992',3h 00m',540.00',0',226.56',9561.31',XXXXX',111-11-11-111',Jagiello',纡',ska',8',Jagiello',纡ska',Warszawa Jagiello',纡纡ska',8',11-111',Warszawa no-reply@xxxx.pl","123123123","WAW",ś,śr.06 lip","07:50",śVIEś,śr.06 lip",ś0:00、SZG、ś、r.06唇、10:50']
在“项中,Jagiellonńska”是特殊字符“项中的字符”。当解析器将数据附加到数组中时,字符“ń”是一种分裂字符,我的问题是为什么会发生这种情况?其余项都正确地附加到数组中。在“śr 06.lip”项中,情况完全相同。问题是
数据每个元素可能会多次调用目标类的code>方法。例如,如果进料器跨越块边界,则可能会发生这种情况。看起来当它遇到非ASCII字符时也会发生这种情况。这是一个古老的图例。我找不到在何处记录了这种情况。但是,如果将目标类更改为以下内容,我会我已经用你的数据测试过了
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True #if tag == 'Title' else False
if self.is_title:
self.text.append(u'')
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text[-1] += data
def close(self):
return self.text
为了更好地了解输出的内容,请在解析调用后执行print repr(results)
u'Jagiello\u0144ska\n '
u'\u015br. 06 lip\n '