Python 使用列表删除不支持的unicode字符

Python 使用列表删除不支持的unicode字符,python,web-scraping,unicode,list-comprehension,Python,Web Scraping,Unicode,List Comprehension,我正在尝试编写一个算法,从文本字符串列表中删除非ASCII字符。我通过从网页上抓取段落并将其添加到列表中来整理列表。为此,我编写了一个嵌套for循环,该循环遍历包含字符串的列表的每个元素,然后遍历字符串的字符。我使用的字符串示例列表如下: text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the

我正在尝试编写一个算法,从文本字符串列表中删除非ASCII字符。我通过从网页上抓取段落并将其添加到列表中来整理列表。为此,我编写了一个嵌套for循环,该循环遍历包含字符串的列表的每个元素,然后遍历字符串的字符。我使用的字符串示例列表如下:

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
        'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

然后,我的最后一个操作是,如果字符的ord()值大于128,则替换字符。像这样:

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    for i in range(len(text_list)):
        # for each string in the text list
        for char in text_list[i]:
            # for each character in the individual string
            if ord(char) > 128:
              text_list[i] = text_list[i].replace(char, '')

    return text_list

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    scrubbed_text = [text_list[i].replace(char, '') for i in range(len(text_list))
                     for char in text_list[i] if ord(char) > 128]

    return scrubbed_text

这可以作为嵌套for循环很好地工作。但是因为我想扩展这个,我想我应该把它写成一个列表。像这样:

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    for i in range(len(text_list)):
        # for each string in the text list
        for char in text_list[i]:
            # for each character in the individual string
            if ord(char) > 128:
              text_list[i] = text_list[i].replace(char, '')

    return text_list

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    scrubbed_text = [text_list[i].replace(char, '') for i in range(len(text_list))
                     for char in text_list[i] if ord(char) > 128]

    return scrubbed_text


但出于某种原因,这不起作用。起初我认为这可能与我在表达式中使用的删除unicode字符的方法有关,因为text_list是一个列表,而text_list[I]是一个字符串。因此,我将方法从.strip()更改为.replace()。那没用。然后我认为这可能与我放置.replace()的位置有关,所以我将它在列表中移动,没有任何更改。所以我不知所措。我认为这可能与在涉及unicode过滤的嵌套for循环的特定情况之间进行转换有关,这可能是问题所在。因为并非所有for循环都可以写为list comp,但所有list comp都可以写为for循环。

您要么需要一个外部循环,要么需要第二个理解来解析列表,然后在内部循环中解析字符串:

def删除字符(文本列表):
擦洗的文本=[“”。在文本列表中的x处加入([y表示x中的y,如果ord(y)<128])]
返回清除的文本

有一种更简单的方法可以删除非ascii字符;将字符串编码为ASCII,并指定
errors='ignore'
将其删除。例如:

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
        'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

>>> text[0].encode('ascii', errors='ignore')
b'The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China'
这将为您提供一个字节字符串,即结果的类型为
bytes
。您可以使用
decode()
将其转换回Python字符串:

你可以学究式地指定
.decode('ascii')
,但你的默认编解码器可能已经涵盖了这一点

要将其作为列表执行,请执行以下操作:

def remove_non_ascii_chars(text_list):
    return [s.encode('ascii', errors='ignore').decode('ascii') for s in text_list]

>>> remove_non_ascii_chars(text)
['The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China', 'It is characterised by large, black patches around its eyes, over the ears, and across its round body']
您还可以对函数进行编码,以返回一个生成器,该生成器在许多情况下更具可伸缩性,具体取决于后续代码中字符串的使用方式:

def remove_non_ascii_chars(text_list):
    return (s.encode('ascii', errors='ignore').decode('ascii') for s in text_list)

通常,列表理解并不比嵌套循环快。第二个解决方案即使正确,也不会比第一个解决方案更具可扩展性。考虑使用正则表达式替换。@ Dyz:列表理解也不一定比嵌套循环慢,所以这不是避免使用一个(如果这就是你所暗示的)的原因。列表理解它更简洁和“Pythonic”,因此在许多情况下(性能问题除外)更可取。关于性能,生成器表达式的效率/可伸缩性可能会更高,具体取决于结果的使用方式。@mhawke您的编码/解码解决方案实际上要快得多,但速度要快一个数量级。@mhawke总是花钱衡量您是否关心速度。@mwake我没有说列表理解速度慢。我说它不快。生成器表达式通常比列表理解慢。谢谢,mhawke。是的,我以前试过使用.encode()和.decode(),但是在处理之前,我犯了在我的数据汤对象上使用这些方法的错误it@pancham2016:那么这个解决方案对您有效吗?是的,它有效。我不能完全确定我最初的错误是什么,但它与我决定使用解码和编码的处理步骤有关