Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
无法在python中使用正则表达式替换字符串中的\xe2\x80\xa6\n_Python_Regex - Fatal编程技术网

无法在python中使用正则表达式替换字符串中的\xe2\x80\xa6\n

无法在python中使用正则表达式替换字符串中的\xe2\x80\xa6\n,python,regex,Python,Regex,我有以下字符串: data = "pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you why don't people just \xe2\x80\xa6\n" 我想从中找到所有\[a-z][a-z][0-9]\(\xe2\x80\xa6\在数据字符串末尾给出)表达式,以便替换它们。我尝试了以下代码: re.findall(r“\\[a-z][a-

我有以下字符串:

data = "pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you  why don't people just \xe2\x80\xa6\n"
我想从中找到所有
\[a-z][a-z][0-9]\
(\xe2\x80\xa6\在
数据
字符串末尾给出)表达式,以便替换它们。我尝试了以下代码:

re.findall(r“\\[a-z][a-z][0-9]\\+”,数据)


但它会产生一个空列表。请提供帮助。

如果需要,您必须将字符串定义为
原始字符串,因为python将尝试转换
unicode

data = r"pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you  why don't people just \xe2\x80\xa6\n"

print re.findall(r"\\[a-z][a-z]?[0-9]+", data)
输出:
['\\xe2'、'\\x80'、'\\xa6']

另一种解决方案:

print re.findall(r"\\[a-z]{1,2}\d{1,2}", data)

要处理文本,应使用Unicode字符串:
b”\xe2\x80\xa6“
bytestring是utf-8编码的:

要替换它:

no_ellipsis = text.replace(u"\u2026", "")

有一个名为ftfy的库可以帮助解决Unicode问题。为我节省了时间,值得一试

你的例子

data = "pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you  why don't people just \xe2\x80\xa6\n"

import ftfy
print(ftfy.fix_text(data))

output -->
"pizza won't divorce you pizza won't betray you pizza won't cheat on you pizza won't fight with you  why don't people just …"
\xe2\x80\xa6\n
已替换为
..

--其他示例--

示例1

import ftfy
print(ftfy.fix_text('ünicode'))

output -->
ünicode
示例2

import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))

output -->
•
示例3

import ftfy
print(ftfy.fix_text(u'\u2026'))

output -->
…

你能补充一下你的目标是什么吗?你想要的正是..我猜字符串中的
“\xe2\x80\xa6”
是unicode--
..
。如果数据意外地具有Python bytestring文本中使用的字符转义,那么数据应该先固定在上游并转换为unicode文本:
r”\xe2\x80\xa6.decode('string-escape')。decode('string-escape')。decode('utf-8'))
@J.F.Sebastian guess OP只是想得到一个列表
\x
unicodes…他不想转换它或者其他什么东西,P可能不理解Python源代码中字节、Unicode字符串及其文本表示之间的区别,但答案至少应该承认Python中存在更明智的文本处理方法。
import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))

output -->
•
import ftfy
print(ftfy.fix_text(u'\u2026'))

output -->
…