Python 使用列表删除不支持的unicode字符_Python_Web Scraping_Unicode_List Comprehension

Python 使用列表删除不支持的unicode字符

python web-scraping unicode

Python 使用列表删除不支持的unicode字符,python,web-scraping,unicode,list-comprehension,Python,Web Scraping,Unicode,List Comprehension,我正在尝试编写一个算法，从文本字符串列表中删除非ASCII字符。我通过从网页上抓取段落并将其添加到列表中来整理列表。为此，我编写了一个嵌套for循环，该循环遍历包含字符串的列表的每个元素，然后遍历字符串的字符。我使用的字符串示例列表如下： text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the

我正在尝试编写一个算法，从文本字符串列表中删除非ASCII字符。我通过从网页上抓取段落并将其添加到列表中来整理列表。为此，我编写了一个嵌套for循环，该循环遍历包含字符串的列表的每个元素，然后遍历字符串的字符。我使用的字符串示例列表如下：

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
        'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

然后，我的最后一个操作是，如果字符的ord（）值大于128，则替换字符。像这样：

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    for i in range(len(text_list)):
        # for each string in the text list
        for char in text_list[i]:
            # for each character in the individual string
            if ord(char) > 128:
              text_list[i] = text_list[i].replace(char, '')

    return text_list

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    scrubbed_text = [text_list[i].replace(char, '') for i in range(len(text_list))
                     for char in text_list[i] if ord(char) > 128]

    return scrubbed_text

这可以作为嵌套for循环很好地工作。但是因为我想扩展这个，我想我应该把它写成一个列表。像这样：

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    for i in range(len(text_list)):
        # for each string in the text list
        for char in text_list[i]:
            # for each character in the individual string
            if ord(char) > 128:
              text_list[i] = text_list[i].replace(char, '')

    return text_list

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    scrubbed_text = [text_list[i].replace(char, '') for i in range(len(text_list))
                     for char in text_list[i] if ord(char) > 128]

    return scrubbed_text

但出于某种原因，这不起作用。起初我认为这可能与我在表达式中使用的删除unicode字符的方法有关，因为text_list是一个列表，而text_list[I]是一个字符串。因此，我将方法从.strip（）更改为.replace（）。那没用。然后我认为这可能与我放置.replace（）的位置有关，所以我将它在列表中移动，没有任何更改。所以我不知所措。我认为这可能与在涉及unicode过滤的嵌套for循环的特定情况之间进行转换有关，这可能是问题所在。因为并非所有for循环都可以写为list comp，但所有list comp都可以写为for循环。

您要么需要一个外部循环，要么需要第二个理解来解析列表，然后在内部循环中解析字符串：

def删除字符（文本列表）：
擦洗的文本=[“”。在文本列表中的x处加入（[y表示x中的y，如果ord（y）<128]）]
返回清除的文本

有一种更简单的方法可以删除非ascii字符；将字符串编码为ASCII，并指定

errors='ignore'

将其删除。例如：

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China',
        'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

>>> text[0].encode('ascii', errors='ignore')
b'The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China'

这将为您提供一个字节字符串，即结果的类型为

bytes

。您可以使用

decode（）

将其转换回Python字符串：

你可以学究式地指定

.decode（'ascii'）

，但你的默认编解码器可能已经涵盖了这一点

要将其作为列表执行，请执行以下操作：

def remove_non_ascii_chars(text_list):
    return [s.encode('ascii', errors='ignore').decode('ascii') for s in text_list]

>>> remove_non_ascii_chars(text)
['The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),[5] also known as the panda bear or simply the panda, is a bear[6] native to South Central China', 'It is characterised by large, black patches around its eyes, over the ears, and across its round body']

您还可以对函数进行编码，以返回一个生成器，该生成器在许多情况下更具可伸缩性，具体取决于后续代码中字符串的使用方式：

def remove_non_ascii_chars(text_list):
    return (s.encode('ascii', errors='ignore').decode('ascii') for s in text_list)

通常，列表理解并不比嵌套循环快。第二个解决方案即使正确，也不会比第一个解决方案更具可扩展性。考虑使用正则表达式替换。@ Dyz：列表理解也不一定比嵌套循环慢，所以这不是避免使用一个（如果这就是你所暗示的）的原因。列表理解它更简洁和“Pythonic”，因此在许多情况下（性能问题除外）更可取。关于性能，生成器表达式的效率/可伸缩性可能会更高，具体取决于结果的使用方式。@mhawke您的编码/解码解决方案实际上要快得多，但速度要快一个数量级。@mhawke总是花钱衡量您是否关心速度。@mwake我没有说列表理解速度慢。我说它不快。生成器表达式通常比列表理解慢。谢谢，mhawke。是的，我以前试过使用.encode（）和.decode（），但是在处理之前，我犯了在我的数据汤对象上使用这些方法的错误it@pancham2016：那么这个解决方案对您有效吗？是的，它有效。我不能完全确定我最初的错误是什么，但它与我决定使用解码和编码的处理步骤有关