Python 自定义非ascii字符标记器_Python_Regex_Unicode

Python 自定义非ascii字符标记器

python regex unicode

Python 自定义非ascii字符标记器,python,regex,unicode,Python,Regex,Unicode,我四处寻找了一个定制的解决方案，但是我找不到一个解决我所面临的用例的解决方案用例我正在构建一个“网站”QA测试，脚本将通过大量HTML文档，并识别任何恶意字符。我不能使用纯非ascii方法，因为HTML文档包含“>”等字符和其他次要字符。因此，我正在建立一个unicode彩虹字典，用于识别我的团队和我经常看到的一些常见非ascii字符。下面是我的Python代码 # -*- coding: utf-8 -*- import re unicode_rainbow_dictionary =

我四处寻找了一个定制的解决方案，但是我找不到一个解决我所面临的用例的解决方案

用例

我正在构建一个“网站”QA测试，脚本将通过大量HTML文档，并识别任何恶意字符。我不能使用纯非ascii方法，因为HTML文档包含“>”等字符和其他次要字符。因此，我正在建立一个unicode彩虹字典，用于识别我的团队和我经常看到的一些常见非ascii字符。下面是我的Python代码

# -*- coding: utf-8 -*-

import re

unicode_rainbow_dictionary = {
    u'\u00A0':' ',
    u'\uFB01':'fi',
}

strings = ["This contains the annoying non-breaking space","This is fine!","This is not ﬁne!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print "Epic fail! There is a rogue character in '"+string+"'"
        else:
            print string

这里的问题是字符串数组中的最后一个字符串包含非ascii连字字符（组合fi）。当我运行这个脚本时，它不会捕获连字字符，但在第一种情况下它会捕获不可打断的空格字符

导致误报的原因是什么？

您的代码无法按预期工作，因为在“strings”变量中，非unicode字符串中包含unicode字符。您忘记将“u”放在它们前面，以表示它们应被视为unicode字符串。因此，当您在非unicode字符串中搜索unicode字符串时，它不会按预期工作

如果将此更改为：

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]

它按预期工作

解决像这样的unicode难题是Python3的一个主要好处

这是解决你问题的另一种方法。试着将字符串编码为ASCII码，如果不起作用，就捕获错误，怎么样

def is_this_ascii(s):
    try:
        ignore = unicode(s).encode("ascii")
        return True
    except (UnicodeEncodeError, UnicodeDecodeError):
        return False

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]

for s in strings:
    print(repr(is_this_ascii(s)))

##False
##True
##False

如果将此更改为：

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]

它按预期工作

解决像这样的unicode难题是Python3的一个主要好处

这是解决你问题的另一种方法。试着将字符串编码为ASCII码，如果不起作用，就捕获错误，怎么样

def is_this_ascii(s):
    try:
        ignore = unicode(s).encode("ascii")
        return True
    except (UnicodeEncodeError, UnicodeDecodeError):
        return False

strings = [u"This contains the annoying non-breaking space",u"This is fine!",u"This is not ﬁne!"]

for s in strings:
    print(repr(is_this_ascii(s)))

##False
##True
##False

如果有可能，请尽快切换到Python 3！Python2不擅长处理unicode，而Python3则是本机处理的

for string in strings:
    for character in unicode_rainbow_dictionary:
        if character in string:
            print("Rogue character '" + character + "' in '" + string + "'")

我无法在测试中获得非中断空间。我通过使用

“这包含了恼人的”+chr（160）+“非中断空间”

，之后它匹配了。

如果有可能，那么尽快切换到Python 3！Python2不擅长处理unicode，而Python3则是本机处理的

for string in strings:
    for character in unicode_rainbow_dictionary:
        if character in string:
            print("Rogue character '" + character + "' in '" + string + "'")

我无法在测试中获得非中断空间。我通过使用

“这包含了恼人的”+chr（160）+“不间断空格”

，在匹配之后。

使用Unicode字符串作为@jgfoot指出的所有文本。要做到这一点，最简单的方法是使用

from\uuuuu future\uuuu

将字符串默认为Unicode文本。此外，使用

print

作为函数将使代码Python 2/3兼容：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re

unicode_rainbow_dictionary = {
    '\u00A0':' ',
    '\uFB01':'fi',
}

strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not ﬁne!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print("Epic fail! There is a rogue character in '"+string+"'")
        else:
            print(string)

如@jgfoot所指出的，对所有文本使用Unicode字符串。要做到这一点，最简单的方法是使用

from\uuuuu future\uuuu

将字符串默认为Unicode文本。此外，使用

print

作为函数将使代码Python 2/3兼容：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals,print_function
import re

unicode_rainbow_dictionary = {
    '\u00A0':' ',
    '\uFB01':'fi',
}

strings = ["This contains the\xa0annoying non-breaking space","This is fine!","This is not ﬁne!"]

for string in strings:
    for regex in unicode_rainbow_dictionary:
        result = re.search(regex,string)
        if result:
            print("Epic fail! There is a rogue character in '"+string+"'")
        else:
            print(string)

为什么不使用

从字符串导入ascii字符

并说

如果字母不是ascii字符

？@ATLUS我不能使用纯非ascii方法，因为HTML文档包含“>”等字符和其他次要字符。例如，“了解更多信息>”之类的内容，为什么不实现一个不希望包含的字符串，例如

？如果您打印

ascii_字母

，您实际上会得到

abcdefghijklmnopqrstuvwxyzabefghijklmnopqrstuvwxyz

，那么为什么不添加您自己的字符串，比如

abcdefghijklmnopqrstuvxyzabcdefghijklmnopqrstuvxyz'/

etc您可以包含

unicode\u rainbow\u list

的内容吗？@ATLUS我看不出这会有什么区别，因为在我们这方面，编译我们绝对不想要的元素列表比识别哪些字符更容易如果字母不是ascii字符，为什么不使用字符串导入ascii字符的

，并说？@ATLUS我不能使用纯非ascii方法，因为HTML文档包含诸如“>”等字符和其他次要字符。例如，“了解更多信息>”之类的内容，为什么不实现一个不希望包含的字符串，例如
？如果您打印ascii_字母
，您实际上会得到abcdefghijklmnopqrstuvwxyzabefghijklmnopqrstuvwxyz
，那么为什么不添加您自己的字符串，比如abcdefghijklmnopqrstuvxyzabcdefghijklmnopqrstuvxyz'/
etc您可以包含unicode\u rainbow\u list
的内容吗？@ATLUS我看不出这会有什么区别，因为在我们这方面，编译我们绝对不想要的元素列表比识别哪些字符更容易我会看看我是否可以转换到Python 3，但我们一直在使用Python 2，因为遗留问题…耶，遗留问题--但您提供的解决方案似乎非常简单。我会看看我是否可以转换到Python 3，但我们一直在使用Python 2，因为遗留问题…耶，遗留问题--但您提供的解决方案似乎非常简单