如何对python列表进行编码_Python_Regex_Python 2.7_Encoding_Utf 8

如何对python列表进行编码

python regex python-2.7 encoding utf-8

如何对python列表进行编码,python,regex,python-2.7,encoding,utf-8,Python,Regex,Python 2.7,Encoding,Utf 8,我很难对python列表进行编码，我已经用一个文本文件进行了编码，以便使用re模块计算其中的特定单词代码如下： # encoding text file with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f: for line in f: # Using re module to extract specific words unicode_pattern = re.comp

我很难对python列表进行编码，我已经用一个文本文件进行了编码，以便使用re模块计算其中的特定单词

代码如下：

# encoding text file
with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
    for line in f:
        # Using re module to extract specific words
        unicode_pattern = re.compile(r'\b\w{4,20}\b', re.UNICODE)
        result = unicode_pattern.findall(line)
    word_counts = Counter(result) # It creates a dictionary key and wordCount
    Allwords = []
    for clave in word_counts:
        if word_counts[clave] >= 10: # We look for the most repeated words
            word = clave
            Allwords.append(word)
    print Allwords

部分输出如下所示：

[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']

[...recursos, Partidos, ConstituciÃ³n, vehÃculos, investigaciÃ³n, PolÃticos]

[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']

如果I

print

variable

word

，则输出看起来应该是这样的。但是，当我使用

append

时，所有的单词都会再次中断，如前面的示例所示

我用这个例子：

[x.encode("utf-8") for x in Allwords]

Allwords.append(str(word.encode("utf-8")))

print('[' + ', '.join(Allwords) + ']')

输出看起来与以前完全相同

我也用这个例子：

[x.encode("utf-8") for x in Allwords]

Allwords.append(str(word.encode("utf-8")))

print('[' + ', '.join(Allwords) + ']')

输出会发生变化，但文字看起来与应有的不同：

[...'recursos', 'Partidos', 'Constituci\xc3\xb3n', 'veh\xc3\xadculos', 'investigaci\xc3\xb3n', 'Pol\xc3\xadticos']

一些答案给出了以下示例：

[x.encode("utf-8") for x in Allwords]

Allwords.append(str(word.encode("utf-8")))

print('[' + ', '.join(Allwords) + ']')

输出如下所示：

[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']

[...recursos, Partidos, ConstituciÃ³n, vehÃculos, investigaciÃ³n, PolÃticos]

[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']

老实说，我不想打印列表，只是对其进行编码，以便识别所有项目（单词）

我在找这样的东西：

[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']

[...recursos, Partidos, ConstituciÃ³n, vehÃculos, investigaciÃ³n, PolÃticos]

[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']

如有任何解决问题的建议，我们将不胜感激

谢谢，

您可能想试试看

打印（'['+'，'.join（Allwords）+']'）

您的Unicode字符串列表是正确的。打印时，列表中的项目显示为其

repr（）

函数。打印项目本身时，项目显示为其

str（）

函数。它只是一个显示选项，类似于将整数打印为十进制或十六进制

因此，如果你想正确地看到这些单词，请打印这些单词，但进行比较时，内容是正确的

值得注意的是，Python 3改变了

repr（）

的行为，现在如果终端直接支持非ASCII字符而不支持转义码，则将显示非ASCII字符，

ASCII（）

函数将再现Python 2的

repr（）

行为

语言不会断裂；它们只是以原始格式显示<代码>列表使用

\uuuu repr\uuuu（）

获取其元素的字符串值。是否有一种解决方案来显示列表中带有重音的单词？有点类似。你可以打印（“[“+”，“.join（Allwords）+“]）”它改变了输出，但我得到了：

[…递归，Partidos，construcciÃ³n，vehÃculos，investigaciÃ³n，PolÃticos]

嗯，对不起；我不知道如何帮助你。为什么要将其打印为列表？谢谢，@cxad。使用这个例子，我得到了<代码>[…递归、Partidos、ConstructciÃ³n、vehÃculos、investigaciÃ³n、PolÃticos]另一方面，我不想打印列表，我只是对其进行编码，以便所有项目都能被识别。谢谢，@MarkTolonen！有时很难解释主要问题是什么。我正在学习，所以很难理解所有的概念。当我想将一个列表中的每个项目与另一个列表中的项目进行比较时，就会出现代码问题。因此，如果我比较项目

社交

一切正常，但当我比较

ConstitciÃ³n

时，代码无法识别该项目，因此它跳过该项目。问题出在文本文件中。我将格式从UTF-8更改为ANSI。现在，它不仅显示项目ok，而且识别每个项目。