Python 尝试向xml中的新元素添加文本时出现UnicodeError_Python_Xml_Wordpress_Unicode_Lxml

Python 尝试向xml中的新元素添加文本时出现UnicodeError

python xml wordpress unicode

Python 尝试向xml中的新元素添加文本时出现UnicodeError,python,xml,wordpress,unicode,lxml,Python,Xml,Wordpress,Unicode,Lxml,在过去的几个小时里，我一直在努力克服这个问题。为了解决这个问题，我很难阅读文档和现有的论坛帖子。所以我想在放弃之前，我会尝试一下这个地方，作为最后的努力来解决这个问题基本上，手头的任务是打开一个文件（实际上是许多文件），其中包含我想要放入新XML元素中的全部文本。文本文件实际上是使用Python脚本创建的，因此它可以很好地处理UTF-16和UTF-8。但是，似乎每当我试图将文本内容放入内存以放入新的xml标记（而不是像以前那样写入新的文本文件）时，就会抛出以下错误消息： "Traceback

在过去的几个小时里，我一直在努力克服这个问题。为了解决这个问题，我很难阅读文档和现有的论坛帖子。所以我想在放弃之前，我会尝试一下这个地方，作为最后的努力来解决这个问题

基本上，手头的任务是打开一个文件（实际上是许多文件），其中包含我想要放入新XML元素中的全部文本。文本文件实际上是使用Python脚本创建的，因此它可以很好地处理UTF-16和UTF-8。但是，似乎每当我试图将文本内容放入内存以放入新的xml标记（而不是像以前那样写入新的文本文件）时，就会抛出以下错误消息：

"Traceback (most recent call last):
  File "K:\Users\Johnny\My Documents\PythonSandbox\websiteMigrationScripts\createXmlFile.py", line 87, in <module>
root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
  File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src\lxml\lxml.etree.c:55337)
  File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src\lxml\lxml.etree.c:24657)
  File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src\lxml\lxml.etree.c:24506)
  File "src/lxml/apihelpers.pxi", line 1431, in lxml.etree._utf8 (src\lxml\lxml.etree.c:32293)
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udc92' in position 1862: surrogates not allowed"

“回溯（最近一次呼叫最后一次）：
文件“K:\Users\Johnny\My Documents\PythonSandbox\websiteMigrationScripts\createXmlFile.py”，第87行，在
根[k][directionsBodyIndex]。文本=“”
文件“src/lxml/lxml.etree.pyx”，第1031行，位于lxml.etree.\u Element.text.\uuuuuu set\uuuu（src\lxml\lxml.etree.c:55337）
文件“src/lxml/apihelpers.pxi”，第711行，在lxml.etree.\u setNodeText（src\lxml\lxml.etree.c:24657）中
lxml.etree.\u createTextNode（src\lxml\lxml.etree.c:24506）中第699行的文件“src/lxml/apihelpers.pxi”
文件“src/lxml/apihelpers.pxi”，第1431行，位于lxml.etree.\u utf8（src\lxml\lxml.etree.c:32293）
UnicodeEncodeError:“utf-8”编解码器无法对1862位置的字符“\udc92”进行编码：不允许使用代理”

我的脚本如下所示：

from bs4 import BeautifulSoup
import os, codecs
import imageFilesSub
import utf16FilesList
import openpyxl, lxml
from openpyxl.utils import get_column_letter, column_index_from_string

# First get the list of files to parse
filesDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\Website\tier 2 pages\tier 3 pages\tier 4 pages'
filesInDir = os.listdir(filesDir)
filesOutDir = r'.\blogsToParse'
filesToParse = []
for file in filesInDir:
    if (file.endswith('-template.html')) and not('travel-blog' in file) and not('accommodations' in file) and not('best-time-to-visit' in file) and not('activities' in file) and not('how-to-get-there' in file) and not('planning-and-preparing' in file) and not('restaurants' in file) and not('which-side' in file) and not('books-and-maps' in file):
        filesToParse.append(file)

# Then get a list of (unique) slugs that represent a unique row entry in the WoW Database
wowDatabaseDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\WordPressSite'
wowSpreadsheet = r'WoW Database for WP.xlsm'
wb = openpyxl.load_workbook(wowDatabaseDir + '\\' + wowSpreadsheet, data_only=True)
sheet = wb.active

# the following loop returns to maxRow the highest non-empty row
maxRow = 1  # openpyxl indexes from 1 not 0
for i in range(1, sheet.max_row): 
    if sheet.cell(row=i, column=33).value is None:
        pass
    else:
        maxRow = maxRow + 1

# now make a list containing the directory names of the writeups
writeupDirs = []
slugList = []
for i in range(3, maxRow + 1):
    writeupDirs.append(sheet.cell(row=i, column=18).value)
    slugList.append(sheet.cell(row=i, column=33).value)

from lxml import etree

xmlFile = 'WoW Database for WP 2017-01-01.xml'
data_file = wowDatabaseDir + '\\' + xmlFile
tree = etree.ElementTree(file=data_file)
root = tree.getroot()

k = 0
for element in root:
    try:
        element.attrib[root[k][0].tag] = root[k][0].text  # this puts Entry_No as an attribute of Row
        element.attrib[root[k][1].tag] = root[k][1].text  # this puts Waterfall Name as an attribute of Row
        root[k].append(etree.Element("Introduction_Body"))
        root[k].append(etree.Element("Directions_Body"))

        # need to go through some hoops and hurdles just to find the index of the desired tag (there must be a better way)
        children = []
        for child in root[k]:
            children.append(child.tag)
        fileDirIndex = children.index('File_directory')
        postSlugIndex = children.index('Post_Slug')
        introFilePtrIndex = children.index('Introduction_File_Ptr')
        introBodyIndex = children.index('Introduction_Body')
        introFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][introFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(introFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(introFile, 'r', encoding="utf-8", errors="surrogateescape")
        introBuffer = []
        for line in inFile:
            introBuffer.append(line)
        root[k][introBodyIndex].text = '<![CDATA[' + "".join(introBuffer) + ']]>'
        inFile.close()

        directionsFilePtrIndex = children.index('Directions_File_Ptr')
        directionsBodyIndex = children.index('Directions_Body')
        directionsFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][directionsFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(directionsFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(directionsFile, 'r', encoding="utf-8", errors="surrogateescape")
        directionsBuffer = []
        for line in inFile:
            directionsBuffer.append(line)
        root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
        inFile.close()
    except IndexError:
        pass
    k = k+1

<div class="ad-right">[adrotate banner="17"]</div>

Wapama Falls sits in Hetch Hetchy, which is in the remote northwest corner of Yosemite National Park.  We generally drive up to Yosemite Valley from Los Angeles before getting up to Hetch Hetchy so we'll describe this route first.  It typically takes us about 6 hours to make the drive from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to Yosemite Valley.  We normally go from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20013079&aid=825833" target="_blank">Fresno</a> via the I-5 and Hwy 99, then through <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014886&aid=825833" target="_blank">Oakhurst</a> and <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20016735&aid=825833" target="_blank">Wawona</a> via the Hwy 41.  Once in Yosemite Valley, we'd drive west towards the Big Oak Flat Road where the Hwy 120 and Hwy 140 junction.  Then, we'd drive uphill on the Hwy 140 towards the Big Oak Flat Entrance (the Northwest Entrance), where we'd leave the park. 

From the Big Oak Flat Entrance on the Big Oak Flat Road (Route 120), we'd shortly have to turn right at the signed turnoff for <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a> and the Evergreen Road.  Then, we'd follow Evergreen Road for 7.5 miles to its junction with Hetch Hetchy Road in <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Turning right onto Hetch Hetchy Road, we'd follow it to the parking lot by the O’Shaughnessy Dam after about seven miles.  On the way, we'd have passed through another entrance fee station.  The two-lane road was a bit narrow in places so we had to drive slowly.  Eventually, we'd reach a car park next to the dam.  The drive from Yosemite Valley to the car park at the O'Shaugnessy Dam took us less than 90 minutes.

From <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015732&aid=825833" target="_blank">San Francisco</a>, we'd drive east towards <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015274&aid=825833" target="_blank">Pleasanton</a>, then continue east on the I-205 towards the Hwy 120 passing through <a rel="nofollow" href="    http://www.booking.com/searchresults.html?city=20013298&aid=825833" target="_blank">Groveland</a> and eventually through the town of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Once we were east of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>, we'd follow the road to the O'Shaugnessy Dam as described above.  Overall, this drive would take around 4 hours without traffic.

从bs4导入美化组
导入操作系统、编解码器
导入ImageFileSub
导入utf16FilesList
导入openpyxl，lxml
从openpyxl.utils导入获取列字母、列索引字符串
#首先获取要分析的文件列表
filesDir=r'K:\Users\Johnny\My Documents\\瀑布世界\网站\第2层页面\第3层页面\第4层页面'
filesInDir=os.listdir（filesDir）
filesOutDir=r'.\blogstorparse'
filesToParse=[]
对于FileIndir中的文件：
如果（file.endswith（'-template.html'）和not（'travel-blog'在文件中）和not（'住宿'在文件中）和not（'best-time-to-visit'在文件中）和not（'activities'在文件中）和not（'how-to-there'在文件中）和not（'planning-and-preparing'在文件中）和not（'restaurants'在文件中）和not（'which-side'在文件中）和not（'books-and-map'在文件中）：
filesToParse.append（文件）
#然后获取一个（唯一的）slug列表，这些slug表示WoW数据库中唯一的行条目
wowDatabaseDir=r'K:\Users\Johnny\My Documents\u World\u瀑布\WordPressSite'
wowSpreadsheet=r'WP.xlsm的WoW数据库'
wb=openpyxl.load\u工作簿（wowDatabaseDir+'\\'+wowSpreadsheet，data\u only=True）
工作表=wb.active
#下面的循环将最高的非空行返回给maxRow
maxRow=1#openpyxl索引从1开始不是0
对于范围内的i（第1页，最大行）：
如果sheet.cell（行=i，列=33）。值为无：
通过
其他：
maxRow=maxRow+1
#现在制作一个包含writeups目录名的列表
writeupDirs=[]
slugList=[]
对于范围内的i（3，maxRow+1）：
writeupDirs.append（sheet.cell（行=i，列=18）.value）
slugList.append（sheet.cell（行=i，列=33）.value）
从lxml导入etree
xmlFile='WP 2017-01-01.xml的WoW数据库'
data\u file=wowDatabaseDir+'\\'+xmlFile
tree=etree.ElementTree（file=data\u文件）
root=tree.getroot（）
k=0
对于根目录中的元素：
尝试：
element.attrib[root[k][0]。tag]=root[k][0]。text#这将条目_No作为行的属性
element.attrib[root[k][1]。tag]=root[k][1]。text#这将瀑布名称作为行的属性
根[k].append（etree.Element（“简介\正文”））
根[k].append（etree.Element（“Directions\u Body”））
#需要经历一些困难才能找到所需标签的索引（必须有更好的方法）
儿童=[]
对于根[k]中的子级：
附加（child.tag）
fileDirIndex=children.index（'File\u directory'）
postSlugIndex=children.index（'Post_Slug'）
introFilePtrIndex=children.index（'Introduction\u File\u Ptr'）
introBodyIndex=children.index（'Introduction\u Body'）
introFile=wowDatabaseDir+'\\'+root[k][fileDirIndex]。text+'\\'+root[k][introFilePtrIndex]。text
如果根[k][postslaugindex].utf16FilesList.utf16List中的文本：#检查slug是否进行unicode特殊处理
infle=open（introFile，'r'，encoding=“utf-16”，errors=“subrogateScape”）#utf-16适用于中文，但不适用于其他任何内容
其他：
infle=open（introFile，'r'，encoding=“utf-8”，errors=“代理景观”）
introBuffer=[]
对于填充中的线：
introBuffer.append（行）
根[k][introBodyIndex]。文本=“”
infle.close（）
directionsFilePtrIndex=children.index（'Directions\u File\u Ptr'））
directionsBodyIndex=子项索引（'Directions\u Body'）
directionsFile=wowDatabaseDir+'\\'+root[k][fileDirIndex]。text+'\\'+root[k][directionsFilePtrIndex]。text
如果根[k][postslaugindex].utf16FilesList.utf16List中的文本：#检查slug是否进行unicode特殊处理
infle=open（directionsFile，'r'，encoding=“utf-16”，errors=“subrogateScape”）#utf-16适用于中文，但不适用于其他任何内容
其他：
infle=open（directionsFile'r'，encoding=“utf-8”，errors=“subscrateescape”）
directionsBuffer=[]
对于填充中的线：
directionsBuffer.append（行）
根[k][directionsBodyIndex]。文本=“”
infle.close（）
除索引器外：
通过
k=k+1

有问题的文本文件（至少是第一个标记的文本文件）如下所示：

from bs4 import BeautifulSoup
import os, codecs
import imageFilesSub
import utf16FilesList
import openpyxl, lxml
from openpyxl.utils import get_column_letter, column_index_from_string

# First get the list of files to parse
filesDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\Website\tier 2 pages\tier 3 pages\tier 4 pages'
filesInDir = os.listdir(filesDir)
filesOutDir = r'.\blogsToParse'
filesToParse = []
for file in filesInDir:
    if (file.endswith('-template.html')) and not('travel-blog' in file) and not('accommodations' in file) and not('best-time-to-visit' in file) and not('activities' in file) and not('how-to-get-there' in file) and not('planning-and-preparing' in file) and not('restaurants' in file) and not('which-side' in file) and not('books-and-maps' in file):
        filesToParse.append(file)

# Then get a list of (unique) slugs that represent a unique row entry in the WoW Database
wowDatabaseDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\WordPressSite'
wowSpreadsheet = r'WoW Database for WP.xlsm'
wb = openpyxl.load_workbook(wowDatabaseDir + '\\' + wowSpreadsheet, data_only=True)
sheet = wb.active

# the following loop returns to maxRow the highest non-empty row
maxRow = 1  # openpyxl indexes from 1 not 0
for i in range(1, sheet.max_row): 
    if sheet.cell(row=i, column=33).value is None:
        pass
    else:
        maxRow = maxRow + 1

# now make a list containing the directory names of the writeups
writeupDirs = []
slugList = []
for i in range(3, maxRow + 1):
    writeupDirs.append(sheet.cell(row=i, column=18).value)
    slugList.append(sheet.cell(row=i, column=33).value)

from lxml import etree

xmlFile = 'WoW Database for WP 2017-01-01.xml'
data_file = wowDatabaseDir + '\\' + xmlFile
tree = etree.ElementTree(file=data_file)
root = tree.getroot()

k = 0
for element in root:
    try:
        element.attrib[root[k][0].tag] = root[k][0].text  # this puts Entry_No as an attribute of Row
        element.attrib[root[k][1].tag] = root[k][1].text  # this puts Waterfall Name as an attribute of Row
        root[k].append(etree.Element("Introduction_Body"))
        root[k].append(etree.Element("Directions_Body"))

        # need to go through some hoops and hurdles just to find the index of the desired tag (there must be a better way)
        children = []
        for child in root[k]:
            children.append(child.tag)
        fileDirIndex = children.index('File_directory')
        postSlugIndex = children.index('Post_Slug')
        introFilePtrIndex = children.index('Introduction_File_Ptr')
        introBodyIndex = children.index('Introduction_Body')
        introFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][introFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(introFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(introFile, 'r', encoding="utf-8", errors="surrogateescape")
        introBuffer = []
        for line in inFile:
            introBuffer.append(line)
        root[k][introBodyIndex].text = '<![CDATA[' + "".join(introBuffer) + ']]>'
        inFile.close()

        directionsFilePtrIndex = children.index('Directions_File_Ptr')
        directionsBodyIndex = children.index('Directions_Body')
        directionsFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][directionsFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(directionsFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(directionsFile, 'r', encoding="utf-8", errors="surrogateescape")
        directionsBuffer = []
        for line in inFile:
            directionsBuffer.append(line)
        root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
        inFile.close()
    except IndexError:
        pass
    k = k+1

<div class="ad-right">[adrotate banner="17"]</div>

Wapama Falls sits in Hetch Hetchy, which is in the remote northwest corner of Yosemite National Park.  We generally drive up to Yosemite Valley from Los Angeles before getting up to Hetch Hetchy so we'll describe this route first.  It typically takes us about 6 hours to make the drive from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to Yosemite Valley.  We normally go from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20013079&aid=825833" target="_blank">Fresno</a> via the I-5 and Hwy 99, then through <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014886&aid=825833" target="_blank">Oakhurst</a> and <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20016735&aid=825833" target="_blank">Wawona</a> via the Hwy 41.  Once in Yosemite Valley, we'd drive west towards the Big Oak Flat Road where the Hwy 120 and Hwy 140 junction.  Then, we'd drive uphill on the Hwy 140 towards the Big Oak Flat Entrance (the Northwest Entrance), where we'd leave the park. 

From the Big Oak Flat Entrance on the Big Oak Flat Road (Route 120), we'd shortly have to turn right at the signed turnoff for <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a> and the Evergreen Road.  Then, we'd follow Evergreen Road for 7.5 miles to its junction with Hetch Hetchy Road in <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Turning right onto Hetch Hetchy Road, we'd follow it to the parking lot by the O’Shaughnessy Dam after about seven miles.  On the way, we'd have passed through another entrance fee station.  The two-lane road was a bit narrow in places so we had to drive slowly.  Eventually, we'd reach a car park next to the dam.  The drive from Yosemite Valley to the car park at the O'Shaugnessy Dam took us less than 90 minutes.

From <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015732&aid=825833" target="_blank">San Francisco</a>, we'd drive east towards <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015274&aid=825833" target="_blank">Pleasanton</a>, then continue east on the I-205 towards the Hwy 120 passing through <a rel="nofollow" href="    http://www.booking.com/searchresults.html?city=20013298&aid=825833" target="_blank">Groveland</a> and eventually through the town of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Once we were east of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>, we'd follow the road to the O'Shaugnessy Dam as described above.  Overall, this drive would take around 4 hours without traffic.

[adrotate banner=“17”]
瓦帕马瀑布坐落在约塞米蒂国家公园西北偏僻角落的赫奇赫奇。我们通常从洛杉矶开车到约塞米蒂山谷，然后再到赫奇-赫奇，所以我们将首先描述这条路线。它通常带我们去abo