Python BeautifulSoupTag、strings和UnicodeEncodeError并不那么漂亮_Python_Utf 8_Ascii_Web Scraping_Beautifulsoup

Python BeautifulSoupTag、strings和UnicodeEncodeError并不那么漂亮

python utf-8 web-scraping

Python BeautifulSoupTag、strings和UnicodeEncodeError并不那么漂亮,python,utf-8,ascii,web-scraping,beautifulsoup,Python,Utf 8,Ascii,Web Scraping,Beautifulsoup,今天早上，我花了几个令人沮丧的小时，试图处理从被刮下的网页中提取的字符串。我似乎找不到一种一致的方法来降低提取的字符串的大小写，这样我就可以检查关键字了——这让我感到很为难以下是从DOM元素检索文本的代码片段： temp = i.find('div', 'foobar').find('div') if temp is not None and temp.contents is not None: temp2 = whitespace.sub(' ', temp.contents[0])

今天早上，我花了几个令人沮丧的小时，试图处理从被刮下的网页中提取的字符串。我似乎找不到一种一致的方法来降低提取的字符串的大小写，这样我就可以检查关键字了——这让我感到很为难

以下是从DOM元素检索文本的代码片段：

temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
    temp2 = whitespace.sub(' ', temp.contents[0])
    content = str(temp2)

UnicodeEncodeError:“ascii”编解码器无法在中对字符u'\xa0'进行编码位置150：序号不在范围内（128）

我还尝试了以下陈述——没有一个有效；i、 e.它们导致抛出相同的错误：

content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()

有人知道如何将BeautifulSoupTag中包含的文本转换为小写ASCII，因此我可能会对关键字进行不区分大小写的搜索吗？

您可能需要ASCII，但需要Unicode，很可能您已经拥有了它。XML解析器返回

unicode

对象

首先做

打印类型（temp2）

。。。它应该是

unicode

，除非发生了不幸的事情，比如

whitespace.sub（）

thingy；那是什么

如果要将多个空白字符规范化为单个空格，请执行以下操作

temp2=u'。加入（临时内容[0].split（））

这将使讨厌的u'\xA0'消失，因为它是一个空白（不间断空间）

然后尝试

content=temp2.lower（）