Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用Python将XML转换为数据帧;不明白为什么不解析xml。编码问题?_Xml_Parsing_Lxml_Xml Encoding - Fatal编程技术网

用Python将XML转换为数据帧;不明白为什么不解析xml。编码问题?

用Python将XML转换为数据帧;不明白为什么不解析xml。编码问题?,xml,parsing,lxml,xml-encoding,Xml,Parsing,Lxml,Xml Encoding,非常感谢你的帮助。我已经忙了两天多了,四处浏览以了解为什么我不能访问这个xml文件并将其内容放入df中。我的目标是将工作表放在数据帧中的xml文件中。我知道有好几篇文章涉及这个话题,但我似乎面临着一些错误,使它变得复杂 该数据是从一家知名ETF提供商处下载的。它以“.xls”格式下载,但实际上是一种“xml”格式;显然是Excel xlm。所以一个简单的pd.read\u excel是行不通的。这就是我被迫进入xml格式和库(如LXML和xml.etree.ElementTree)的地方。不过,

非常感谢你的帮助。我已经忙了两天多了,四处浏览以了解为什么我不能访问这个xml文件并将其内容放入df中。我的目标是将工作表放在数据帧中的xml文件中。我知道有好几篇文章涉及这个话题,但我似乎面临着一些错误,使它变得复杂

该数据是从一家知名ETF提供商处下载的。它以“.xls”格式下载,但实际上是一种“xml”格式;显然是Excel xlm。所以一个简单的pd.read\u excel是行不通的。这就是我被迫进入xml格式和库(如LXML和xml.etree.ElementTree)的地方。不过,我在BS4工作过一段时间

xml下载没有指定任何编码,当我试图解析它时,它返回错误。因此,我尝试了chardet和et.XMLParser来发现它的编码,并在解析器中对其进行“硬设置”。但是没有用。解析时返回:

'lxml.etree.XMLSyntaxError:文档为空,第1行第1列'

我没有直接解析它(参见下面的xml_tree1),而是尝试用fromstring读取xml,我注意到了一些胡言乱语。所以我把它换成了零:

xml_str=xml_file.read()

现在我有了干净的xml代码,但在我的根目录中仍然找不到任何children。事实上,它似乎是空的,根本没有被解析。我的知识让我失望。有人能把我推到正确的方向吗?我的问题是在早期阶段;我似乎无法解析文件和底层格式。第二个问题是,我需要解析文档中各个工作表上的ss:表。在代码的进一步部分,我草草记下了一些示例供我使用。任何评论都很受欢迎

这些是对我帮助最大的帖子

xml的源代码可以在这里找到(荷兰语版本)。你可以在右上角下载

xml的片段:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
</ss:Style>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
</ss:Style>
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
</ss:Style>
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
</ss:Style>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Overzicht">
<ss:Table>
<ss:Row >
<ss:Cell ss:StyleID="headerstyle">
<ss:Data ss:Type="String">iShares Core MSCI World UCITS ETF</ss:Data>
</ss:Cell>
</ss:Row>''

我最终得到了下面的代码

对我来说是可行的,但仍然不明白为什么我不能直接解析文件,需要替换字符串中的乱码

想法

也许我可以让其他人满意下面的内容。花了我太多时间;):

干杯

import lxml.etree as et
import io
import chardet
import pandas as pd

filepath = 'C:\\MSCI.xml'
namespace = '{urn:schemas-microsoft-com:office:spreadsheet}'
find_elem = 'Worksheet'
ws_name = 'Posities'

# Capture encoding
with open(filepath, 'rb') as f:
    data = f.read()
xml_enc = chardet.detect(data).get('encoding')
if xml_enc == 'UTF-8-SIG':
    xml_enc = xml_enc.replace('-SIG', '')

'''
##########################################################################
### Parse the xml file, iterate through it, append and build dataframe ###
##########################################################################
# https://stackoverflow.com/questions/10242237/lxml-etree-iterparse-error-typeerror-reading-file-objects-must-return-plain-st
# https://stackoverflow.com/questions/36804794/iterparse-large-xml-using-python
# https://riptutorial.com/python/example/25995/opening-and-reading-large-xml-files-using-iterparse--incremental-parsing-
# https://stackoverflow.com/questions/28253006/python-element-tree-iterparse-filter-nodes-and-children
# https://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy
# https://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces
# https://stackoverflow.com/questions/38790012/how-to-get-all-the-tags-in-an-xml-using-python
'''

with open(filepath) as xml_file:

    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!
    xml_byte = io.BytesIO(xml_str.encode(xml_enc))

    worksheet = []
    for event, elem in et.iterparse(xml_byte, recover=True, events=('start', 'end')):
        if elem.tag == et.QName(namespace + find_elem) and event == 'start':
            for name, value in elem.items():
                if value == ws_name:
                    for table in elem:
                        row_values = []
                        for row in table:
                            cell_values = []
                            for cells in row:
                                for data in cells:
                                    content = data.text
                                    cell_values.append(content)
                            row_values.append(cell_values)
                    worksheet.append(row_values)
    xml_df_concat = pd.concat([pd.DataFrame(worksheet[i]) for i in range(len(worksheet))], ignore_index=True)

“乱码”是两个字节的顺序标记(参见)。文件的前六个字节是EF,BB,BF,EF,BB,BF。允许使用单字节顺序标记(即使UTF-8不需要或不建议使用)。两个BOM表使文件损坏(至少从XML的角度来看)。非常感谢!正如自制的蟒蛇蛇是不知道这些文件BOM的。谢谢你教我。一位著名的资产管理人在其官方文档中放置损坏的文件,这仍然令人尴尬。我还了解到,在UTF-8中,既不需要也不建议使用BOM()。
import lxml.etree as et
import io
import chardet
import pandas as pd

filepath = 'C:\\MSCI.xml'
namespace = '{urn:schemas-microsoft-com:office:spreadsheet}'
find_elem = 'Worksheet'
ws_name = 'Posities'

# Capture encoding
with open(filepath, 'rb') as f:
    data = f.read()
xml_enc = chardet.detect(data).get('encoding')
if xml_enc == 'UTF-8-SIG':
    xml_enc = xml_enc.replace('-SIG', '')

'''
##########################################################################
### Parse the xml file, iterate through it, append and build dataframe ###
##########################################################################
# https://stackoverflow.com/questions/10242237/lxml-etree-iterparse-error-typeerror-reading-file-objects-must-return-plain-st
# https://stackoverflow.com/questions/36804794/iterparse-large-xml-using-python
# https://riptutorial.com/python/example/25995/opening-and-reading-large-xml-files-using-iterparse--incremental-parsing-
# https://stackoverflow.com/questions/28253006/python-element-tree-iterparse-filter-nodes-and-children
# https://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy
# https://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces
# https://stackoverflow.com/questions/38790012/how-to-get-all-the-tags-in-an-xml-using-python
'''

with open(filepath) as xml_file:

    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!
    xml_byte = io.BytesIO(xml_str.encode(xml_enc))

    worksheet = []
    for event, elem in et.iterparse(xml_byte, recover=True, events=('start', 'end')):
        if elem.tag == et.QName(namespace + find_elem) and event == 'start':
            for name, value in elem.items():
                if value == ws_name:
                    for table in elem:
                        row_values = []
                        for row in table:
                            cell_values = []
                            for cells in row:
                                for data in cells:
                                    content = data.text
                                    cell_values.append(content)
                            row_values.append(cell_values)
                    worksheet.append(row_values)
    xml_df_concat = pd.concat([pd.DataFrame(worksheet[i]) for i in range(len(worksheet))], ignore_index=True)