Python潜在Dirichlet分配停止\u令牌错误_Python_Unicode_Lda

Python潜在Dirichlet分配停止\u令牌错误

python unicode

Python潜在Dirichlet分配停止\u令牌错误,python,unicode,lda,Python,Unicode,Lda,我的代码基于以下代码：我可以用较少的文件数运行我的程序，但是当我开始使用大约1000个较大的文件数时，会出现以下错误： ReadWrite.py:59:UnicodeWarning:Unicode相等比较无法将两个参数转换为Unicode-将它们解释为不相等 stopped_tokens=[i表示令牌中的i，如果不是en_stop中的i] 我想知道以前是否有人遇到过这种情况，或者是否有人知道如何修复此错误。似乎您正在尝试比较列表理解中不同类型的变量en_stop包含unicode变量。我猜，您

我的代码基于以下代码：

我可以用较少的文件数运行我的程序，但是当我开始使用大约1000个较大的文件数时，会出现以下错误：

ReadWrite.py:59:UnicodeWarning:Unicode相等比较无法将两个参数转换为Unicode-将它们解释为不相等 stopped_tokens=[i表示令牌中的i，如果不是en_stop中的i]

我想知道以前是否有人遇到过这种情况，或者是否有人知道如何修复此错误。

似乎您正在尝试比较列表理解中不同类型的变量

en_stop

包含unicode变量。我猜，您正在从文件中读取的令牌具有utf-8、cp1251等编码。您应该尝试确定令牌的编码类型。您可以这样做：

encoding = 'utf-8' # assign name like 'utf-8', 'cp1251', etc.
string = tokens[0]
try:
    string.decode(encoding)
    print 'string is {}'.format(encoding)
except UnicodeError:
    print 'string is not {}'.format(encoding)

stopped_tokens = [i for i in tokens if not unicode(i, encoding) in en_stop]

找到正确的编码后，您可以通过以下方式获得

停止的\u令牌

：

encoding = 'utf-8' # assign name like 'utf-8', 'cp1251', etc.
string = tokens[0]
try:
    string.decode(encoding)
    print 'string is {}'.format(encoding)
except UnicodeError:
    print 'string is not {}'.format(encoding)

stopped_tokens = [i for i in tokens if not unicode(i, encoding) in en_stop]

unicode（i，encoding）

应该在您的列表理解中将您的令牌转换为unicode表示。

我接受了您的建议，并检查以确保我拥有的文件是utf-8。但是，当我运行代码更改时，您建议我将错误更改为UnicodeDecodeError:“ascii”编解码器无法解码位置2:序号不在范围（128）中的字节0xe2。这是因为我把文件做成了utf-8吗？