Python和字符串重音符号_Python_Utf 8_Latin1

Python和字符串重音符号

python utf-8

Python和字符串重音符号,python,utf-8,latin1,Python,Utf 8,Latin1,我正在制作网页刮板。我访问谷歌搜索，获取网页链接，然后获取标签的内容。例如，问题是字符串“P\xe1gina N\xe3o Encontrada！”应该是“Página Não Encontrada！”。我试着解码成拉丁语-1，然后编码成utf-8，但没有成功 r2 = requests.get(item_str) texto_pagina = r2.text soup_item = BeautifulSoup(texto_pagina,"html.parser")

我正在制作网页刮板。
我访问谷歌搜索，获取网页链接，然后获取

标签的内容。
例如，问题是字符串

“P\xe1gina N\xe3o Encontrada！”

应该是

“Página Não Encontrada！”

。我试着解码成拉丁语-1，然后编码成utf-8，但没有成功

    r2 = requests.get(item_str)
    texto_pagina = r2.text
    soup_item = BeautifulSoup(texto_pagina,"html.parser")
    empresa = soup_item.find_all("title")
    print(empresa_str.decode('latin1').encode('utf8'))

你能帮帮我吗？

谢谢

您可以将检索到的文本变量更改为：

string = u'P\xe1gina N\xe3o Encontrada!'.encode('utf-8')

打印完

string

后，它似乎对我来说很好

编辑

您是否尝试过只使用

empresa\u str.decode（'latin1'）

，而不是添加

.encode（'utf8'）

例如：

string = empresa_str.decode('latin_1')

这不是最优雅的解决方案，但对我来说很有效：

def remove_all(substr, str):
 index = 0
 length = len(substr)
 while string.find(str, substr) != -1:
    index = string.find(str, substr)
    str = str[0:index] + str[index+length:]
 return str

 def latin1_to_ascii (unicrap):
    xlate={ 'xc3cb3':'o' , 'xc3xa7':'c','xc3xb5':'o',  'xc3xa3':'a',  'xc3xa9':'e',
    'xc0':'A', 'xc1':'A', 'xc2':'A', 'xc3':'A', 'xc4':'A', 'xc5':'A',
    'xc6':'Ae', 'xc7':'C',
    'xc8':'E', 'xc9':'E', 'xca':'E', 'xcb':'E',
    'xcc':'I', 'xcd':'I', 'xce':'I', 'xcf':'I',
    'xd0':'Th', 'xd1':'N',
    'xd2':'O', 'xd3':'O', 'xd4':'O', 'xd5':'O', 'xd6':'O', 'xd8':'O',
    'xd9':'U', 'xda':'U', 'xdb':'U', 'xdc':'U',
    'xdd':'Y', 'xde':'th', 'xdf':'ss',
    'xe0':'a', 'xe1':'a', 'xe2':'a', 'xe3':'a', 'xe4':'a', 'xe5':'a',
    'xe6':'ae', 'xe7':'c',
    'xe8':'e', 'xe9':'e', 'xea':'e', 'xeb':'e',
    'xec':'i', 'xed':'i', 'xee':'i', 'xef':'i',
    'xf0':'th', 'xf1':'n',
    'xf2':'o', 'xf3':'o', 'xf4':'o', 'xf5':'o', 'xf6':'o', 'xf8':'o',
    'xf9':'u', 'xfa':'u', 'xfb':'u', 'xfc':'u',
    'xfd':'y', 'xfe':'th', 'xff':'y',
    'xa1':'!', 'xa2':'{cent}', 'xa3':'{pound}', 'xa4':'{currency}',
    'xa5':'{yen}', 'xa6':'|', 'xa7':'{section}', 'xa8':'{umlaut}',
    'xa9':'{C}', 'xaa':'{^a}', 'xab':'<<', 'xac':'{not}',
    'xad':'-', 'xae':'{R}', 'xaf':'_', 'xb0':'{degrees}',
    'xb1':'{+/-}', 'xb2':'{^2}', 'xb3':'{^3}', 'xb4':'',
    'xb5':'{micro}', 'xb6':'{paragraph}', 'xb7':'*', 'xb8':'{cedilla}',
    'xb9':'{^1}', 'xba':'{^o}', 'xbb':'>>', 
    'xbc':'{1/4}', 'xbd':'{1/2}', 'xbe':'{3/4}', 'xbf':'?',
    'xd7':'*', 'xf7':'/'
    }
    unicrap = remove_all ('\\', unicrap)
    unicrap = remove_all('&amp;', unicrap)
    unicrap = remove_all('u2013', unicrap)

    r = unicrap
    for item,valor in xlate.items():
        #print item, unicrap.find(item)
        r = r.replace(item,valor)
    return r

def-remove_-all（substr，str）：
索引=0
长度=长度（substr）
while string.find（str，substr）！=-1:
index=string.find（str，substr）
str=str[0:index]+str[index+length:]
返回str
定义拉丁1到ascii（unicrap）：
xlate={'xc3cb3'：'o'，'xc3xa7'：'c'，'xc3xb5'：'o'，'xc3xa3'：'a'，'xc3xa9'：'e'，
‘xc0’：‘A’、‘xc1’：‘A’、‘xc2’：‘A’、‘xc3’：‘A’、‘xc4’：‘A’、‘xc5’：‘A’，
‘xc6’：‘Ae’，‘xc7’：‘C’，
‘xc8’：‘E’，‘xc9’：‘E’，‘xca’：‘E’，‘xcb’：‘E’，
‘xcc’：‘I’，‘xcd’：‘I’，‘xce’：‘I’，‘xcf’：‘I’，
“xd0”：“Th”，“xd1”：“N”，
‘xd2’：‘O’、‘xd3’：‘O’、‘xd4’：‘O’、‘xd5’：‘O’、‘xd6’：‘O’、‘xd8’：‘O’，
‘xd9’：‘U’，‘xda’：‘U’，‘xdb’：‘U’，‘xdc’：‘U’，
‘xdd’：‘Y’，‘xde’：‘th’，‘xdf’：‘ss’，
‘xe0’：‘a’、‘xe1’：‘a’、‘xe2’：‘a’、‘xe3’：‘a’、‘xe4’：‘a’、‘xe5’：‘a’，
‘xe6’：‘ae’，‘xe7’：‘c’，
‘xe8’：‘e’，‘xe9’：‘e’，‘xea’：‘e’，‘xeb’：‘e’，
‘xec’：‘i’、‘xed’：‘i’、‘xee’：‘i’、‘xef’：‘i’，
“xf0”：“th”，“xf1”：“n”，
“xf2”：“o”，“xf3”：“o”，“xf4”：“o”，“xf5”：“o”，“xf6”：“o”，“xf8”：“o”，
“xf9”：“u”，“xfa”：“u”，“xfb”：“u”，“xfc”：“u”，
‘xfd’：‘y’，‘xfe’：‘th’，‘xff’：‘y’，
'xa1'：'！'，'xa2'：'{cent}'，'xa3'：'{pound}'，'xa4'：'{currency}'，
‘xa5’：‘yen’、‘xa6’：‘|’、‘xa7’：‘section’、‘xa8’：‘umlaut’，
'xa9'：'{C}'，'xaa'：'{^a}'，'xab'：''，
‘xbc’：‘1/4’、‘xbd’：‘1/2’、‘xbe’：‘3/4’、‘xbf’：‘？’，
“xd7”：“*”，“xf7”：“/”
}
unicrap=删除所有（'\\'，unicrap）
unicrap=全部删除（“&；”，unicrap）
unicrap=全部移除（'u2013'，unicrap）
r=unicrap
对于item，xlate.items（）中的valor：
#打印项目，unicrap.find（项目）
r=r.更换（项目、价值）
返回r

也许这里的一些答案不起作用。我已经试过了…谢谢你给我们看打印结果（[empresa]）？所以我们可以确切地看到什么是当前的编码。那是python3？print（empresa_str）：[Ops…P\xe1gina N\xe3o Encontrada！][ANADI Consultoria ERP Totvs][Experfite | Consultoria Microsiga Protheus同系物和认证Totvs-Home][Consultoria Totvs\xae | ALFA Sistemas de Gest\xe3o][Totvs IV2-技术系统：[Consultoria Totvs[CONSULTORIA TOTVS PROTHEUS | Systh]您能按照@your上面的建议来做吗？

print（[empresa]）

？而变量中有'P\xe1gina N\xe3o Encontrada！'。如何使用变量（empresa_str）的语法？如果你按照我在上面的答案中编辑的那样做会发生什么？相同的结果：P\xe1gina N\xe3o Encontrada！当我打印它时，它会完全去除重音和其他变音符号，而不是显示原始值。另外，不要命名变量

str

，也不需要

字符串模块来执行查找；Pythonsstr
几十年来一直采用find
方法，因此string.find（str，substr）
只是说str.find（substr）
的冗长/缓慢的方式。