Python 从格式不正确的XML中获取列名_Python_Pandas_Beautifulsoup_Xml Parsing

Python 从格式不正确的XML中获取列名

python pandas

Python 从格式不正确的XML中获取列名,python,pandas,beautifulsoup,xml-parsing,Python,Pandas,Beautifulsoup,Xml Parsing,我有一个XML格式不正确，因为我在尝试读取XML时遇到以下错误： import xml.etree.ElementTree as ET ET.parse(r'my.xml') 我得到下面的错误 ParseError:格式不正确（无效令牌）：第2034行第317列因此，我使用beautifulsou通过以下代码读取xml： from bs4 import BeautifulSoup with open(r'my.xml') as fp: soup = BeautifulSoup(fp

我有一个XML格式不正确，因为我在尝试读取XML时遇到以下错误：

import xml.etree.ElementTree as ET
ET.parse(r'my.xml')

我得到下面的错误

ParseError:格式不正确（无效令牌）：第2034行第317列

因此，我使用

beautifulsou

通过以下代码读取xml：

from bs4 import BeautifulSoup

with open(r'my.xml') as fp:
    soup = BeautifulSoup(fp, 'xml')

如果我打印

汤

，它看起来像这样：

        <Placemark> 
<name>India </name> 
    <description>Country</description> 
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>
        <Placemark> 
<name>USA</name>   
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>            
    <Placemark>   
    <description>City</description> 
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>

问题在于一些

Placemark

标记我根本没有

name

或

description

标记。因此我不知道哪个名字有什么描述。因此，由于缺少标记，名称和描述之间存在不匹配

预期输出数据帧：

Name      Description
India     Country
USA
           City

它们的任何方法我都可以达到相同的效果吗？

由于您分别搜索

名称

和

描述

标记，因此您无法跟踪哪个名称属于哪个描述

相反，您应该单独解析每个

placemark

标记，并处理每个placemark标记缺少

name

和

description

标记的情况

data = []

for placemark in soup.findAll('placemark'):
    try:
        name = placemark.find('name').text.strip()
    except AttributeError:
        name = None
    try:
        description = placemark.find('description').text.strip()
    except AttributeError:
        description = None

    data.append((name, description))

df = pd.DataFrame(data, columns=['Name', 'Description'])
print(df)
#       Name    Description
#  0   India        Country
#  1     USA           None
#  2    None           City

由于您正在分别搜索

name

和

description

标记，因此无法跟踪哪个名称属于哪个描述

相反，您应该单独解析每个

placemark

标记，并处理每个placemark标记缺少

name

和

description

标记的情况

data = []

for placemark in soup.findAll('placemark'):
    try:
        name = placemark.find('name').text.strip()
    except AttributeError:
        name = None
    try:
        description = placemark.find('description').text.strip()
    except AttributeError:
        description = None

    data.append((name, description))

df = pd.DataFrame(data, columns=['Name', 'Description'])
print(df)
#       Name    Description
#  0   India        Country
#  1     USA           None
#  2    None           City