无法在python中使用正则表达式替换字符串中的\xe2\x80\xa6\n
我有以下字符串:无法在python中使用正则表达式替换字符串中的\xe2\x80\xa6\n,python,regex,Python,Regex,我有以下字符串: data = "pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you why don't people just \xe2\x80\xa6\n" 我想从中找到所有\[a-z][a-z][0-9]\(\xe2\x80\xa6\在数据字符串末尾给出)表达式,以便替换它们。我尝试了以下代码: re.findall(r“\\[a-z][a-
data = "pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you why don't people just \xe2\x80\xa6\n"
我想从中找到所有\[a-z][a-z][0-9]\
(\xe2\x80\xa6\在数据
字符串末尾给出)表达式,以便替换它们。我尝试了以下代码:
re.findall(r“\\[a-z][a-z][0-9]\\+”,数据)
但它会产生一个空列表。请提供帮助。如果需要,您必须将字符串定义为
原始字符串,因为python将尝试转换unicode
data = r"pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you why don't people just \xe2\x80\xa6\n"
print re.findall(r"\\[a-z][a-z]?[0-9]+", data)
输出:['\\xe2'、'\\x80'、'\\xa6']
另一种解决方案:
print re.findall(r"\\[a-z]{1,2}\d{1,2}", data)
要处理文本,应使用Unicode字符串:b”\xe2\x80\xa6“
bytestring是utf-8编码的:
要替换它:
no_ellipsis = text.replace(u"\u2026", "")
有一个名为ftfy的库可以帮助解决Unicode问题。为我节省了时间,值得一试
你的例子
data = "pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you why don't people just \xe2\x80\xa6\n"
import ftfy
print(ftfy.fix_text(data))
output -->
"pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you why don't people just …"
注\xe2\x80\xa6\n
已替换为..
--其他示例--
示例1
import ftfy
print(ftfy.fix_text('ünicode'))
output -->
ünicode
示例2
import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))
output -->
•
示例3
import ftfy
print(ftfy.fix_text(u'\u2026'))
output -->
…
你能补充一下你的目标是什么吗?你想要的正是..我猜字符串中的“\xe2\x80\xa6”
是unicode--..
。如果数据意外地具有Python bytestring文本中使用的字符转义,那么数据应该先固定在上游并转换为unicode文本:r”\xe2\x80\xa6.decode('string-escape')。decode('string-escape')。decode('utf-8'))
@J.F.Sebastian guess OP只是想得到一个列表\x
unicodes…他不想转换它或者其他什么东西,P可能不理解Python源代码中字节、Unicode字符串及其文本表示之间的区别,但答案至少应该承认Python中存在更明智的文本处理方法。
import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))
output -->
•
import ftfy
print(ftfy.fix_text(u'\u2026'))
output -->
…