如何将特殊字符读入Python_Python_Xml_Beautifulsoup

如何将特殊字符读入Python

python xml

如何将特殊字符读入Python,python,xml,beautifulsoup,Python,Xml,Beautifulsoup,我正在解析一个XML文件，其中包含一些作者姓名（í=í；、ï=ï；、ò=ò；等）中的外文特殊字符。我的代码在试图处理这些字符时遇到错误“ExpatError:undefined entity:”。我在网上看到了BeautifulSoup库，但不确定如何在不必使用lxml库重写的情况下轻松地将其实现到代码中（如果我的理解是正确的）。解决这个问题的最好办法是什么？干杯要加载的XML数据 <pub> <ID>75</ID>

我正在解析一个XML文件，其中包含一些作者姓名（í=í；、ï=ï；、ò=ò；等）中的外文特殊字符。我的代码在试图处理这些字符时遇到错误“ExpatError:undefined entity:”。我在网上看到了BeautifulSoup库，但不确定如何在不必使用lxml库重写的情况下轻松地将其实现到代码中（如果我的理解是正确的）。解决这个问题的最好办法是什么？干杯

要加载的XML数据

<pub>
    <ID>75</ID>
    <title>Use of Lexicon Density in Evaluating Word Recognizers</title>
    <year>2000</year>
    <booktitle>Multiple Classifier Systems</booktitle>
    <pages>310-319</pages>
    <authors>
        <author>Petr Slav&iacute;k</author>
        <author>Venu Govindaraju</author>
    </authors>
</pub>


75
词汇密度在单词识别器评估中的应用
2000
多分类器系统
310-319
彼得斯拉夫和亚库特；K
戈文达拉朱静脉

Python代码

import sqlite3
con = sqlite3.connect("publications.db")
cur = con.cursor()

from xml.dom import minidom

xmldoc = minidom.parse("test.xml")

#loop through <pub> tags to find number of pubs to grab
root = xmldoc.getElementsByTagName("root")[0]
pubs = [a.firstChild.data for a in root.getElementsByTagName("pub")]
num_pubs = len(pubs)
count = 0

while(count < num_pubs):

    #get data from each <pub> tag
    temp_pub = root.getElementsByTagName("pub")[count]
    temp_ID = temp_pub.getElementsByTagName("ID")[0].firstChild.data
    temp_title = temp_pub.getElementsByTagName("title")[0].firstChild.data
    temp_year = temp_pub.getElementsByTagName("year")[0].firstChild.data
    temp_booktitle = temp_pub.getElementsByTagName("booktitle")[0].firstChild.data
    temp_pages = temp_pub.getElementsByTagName("pages")[0].firstChild.data
    temp_authors = temp_pub.getElementsByTagName("authors")[0]
    temp_author_array = [a.firstChild.data for a in temp_authors.getElementsByTagName("author")]
    num_authors = len(temp_author_array)
    count = count + 1


    #process results into sqlite
    pub_params = (temp_ID, temp_title)
    cur.execute("INSERT INTO publication (id, ptitle) VALUES (?, ?)", pub_params)
    journal_params = (temp_booktitle, temp_pages, temp_year)
    cur.execute("INSERT INTO journal (jtitle, pages, year) VALUES (?, ?, ?)", journal_params)
    x = 0
    while(x < num_authors):
        cur.execute("INSERT OR IGNORE INTO authors (name) VALUES (?)", (temp_author_array[x],))
        x = x + 1

    #display results
    print("\nEntry processed: ", count)
    print("------------------\nPublication ID: ", temp_ID)
    print("Publication Title: ", temp_title)
    print("Year: ", temp_year)
    print("Journal title: ", temp_booktitle)
    print("Pages: ", temp_pages)
    i = 0
    print("Authors: ")
    while(i < num_authors):
        print("-",temp_author_array[i])
        i = i + 1

con.commit()
con.close()    

print("\nNumber of entries processed: ", count)

导入sqlite3 con=sqlite3.connect（“publications.db”） cur=con.cursor（）从xml.dom导入minidom xmldoc=minidom.parse（“test.xml”） #通过标记循环查找要抓取的酒吧数量 root=xmldoc.getElementsByTagName（“根”）[0] pubs=[a.firstChild.data用于root.getElementsByTagName（“pub”）] num_pubs=len（pubs）计数=0 同时（数量<酒吧数量）： #从每个标记获取数据 temp_pub=root.getElementsByTagName（“pub”）[count] temp_ID=temp_pub.getElementsByTagName（“ID”）[0].firstChild.data temp_title=temp_pub.getElementsByTagName（“title”）[0].firstChild.data temp_year=temp_pub.getElementsByTagName（“年份”）[0].firstChild.data temp_booktitle=temp_pub.getElementsByTagName（“booktitle”）[0]。firstChild.data temp_pages=temp_pub.getElementsByTagName（“页面”）[0]。firstChild.data temp_authors=temp_pub.getElementsByTagName（“作者”）[0] temp_author_数组=[a.firstChild.data for a in temp_authors.getElementsByTagName（“author”）] num\u authors=len（临时作者数组）计数=计数+1 #将结果处理为sqlite 发布参数=（临时ID、临时标题） cur.execute（“插入到发布（id，ptitle）值（？，）”，发布参数）日记账参数=（临时书名、临时页数、临时年份）当前执行（“插入日记账（标题、页数、年份）值（？，？）”，日记账参数） x=0 而（x如果您使用的是python3.x，只需导入

html

，您就可以首先对提取的数据进行解码

html.unescape（s）将字符串s中的所有命名字符和数字字符引用（例如>、>、&x3e；）转换为相应的unicode字符

>>import html
>>print(html.unescape("Petr Slav&iacute;k"))

Petr Slavík

似乎html安全字符不能被minidom解析并作为文档对象返回，您必须读取文件并对其进行解码，然后作为字符串发送到模块，如下代码所示

xml.dom.minidom.parseString（字符串[，解析器]）返回表示字符串的文档

file_text = html.unescape(open('text.xml', 'r').read())
xmldoc = minidom.parseString(file_text)

如果您使用的是python3.x，您可以通过简单地导入

html

来解码您首先提取的数据

html.unescape（s）将字符串s中的所有命名字符和数字字符引用（例如>、>、&x3e；）转换为相应的unicode字符

>>import html
>>print(html.unescape("Petr Slav&iacute;k"))

Petr Slavík

似乎html安全字符不能被minidom解析并作为文档对象返回，您必须读取文件并对其进行解码，然后作为字符串发送到模块，如下代码所示

xml.dom.minidom.parseString（字符串[，解析器]）返回表示字符串的文档

file_text = html.unescape(open('text.xml', 'r').read())
xmldoc = minidom.parseString(file_text)

UTF-8支持以下大多数字符：，应该起作用,，加：

刚刚尝试过，仍然得到同样的错误。这是Python的标准配置吗？还是我需要进口

xmldoc=minidom.parse（“test.xml”）.encode（'UTF-8'）

刚刚尝试过，仍然会得到相同的错误。这是Python的标准配置吗？还是我需要进口

xmldoc=minidom.parse（“test.xml”）.encode（'UTF-8'）

我认为这不起作用，因为错误来自我的parse（）语句，所以我需要在读取文件之前进行更改。您可以使用我更新的代码重试。当你阅读文件时失败了，我以前不知道。太棒了，效果很好。在整个项目中，我有2ml行的XML需要读取，这个“file_text”变量可以容纳多少会有限制吗？另外，您知道是否可以使用这个HTML包来处理

&在我的XML中？或者如果有这样的软件包？@douglasrcjames取决于工作机器的内存大小，您也可以在xml.etree.ElementTree.iterparse（）上使用xml.etree.ElementTree
将xml节增量解析为元素树。我认为这不会起作用，因为错误来自我的parse（）语句，因此我需要在读取文件之前进行更改。您可以使用我更新的代码重试。当你阅读文件时失败了，我以前不知道。太棒了，效果很好。在整个项目中，我有2ml行的XML需要读取，这个“file_text”变量可以容纳多少会有限制吗？另外，您知道是否可以使用这个HTML包来处理&在我的XML中？或者如果有这样的包？@douglasrcjames取决于工作机器的内存大小，您也可以在xml.etree.ElementTree.iterparse（）上使用xml.etree.ElementTree
将xml节解析为