python UnicodeWarning：Unicode相等比较。如何解决这个错误？_Python_Unicode_Utf 8

python UnicodeWarning：Unicode相等比较。如何解决这个错误？

python unicode utf-8

python UnicodeWarning：Unicode相等比较。如何解决这个错误？,python,unicode,utf-8,Python,Unicode,Utf 8,像和一样，我运行以下代码： with open(fin,'r') as inFile, open(fout,'w') as outFile: for line in inFile: line = line.replace('."</documents', '"').replace('. ', ' ') print(' '.join([word for word in line.lower().split() if len(word) >=3 and word

像和一样，我运行以下代码：

with open(fin,'r') as inFile, open(fout,'w') as outFile:
  for line in inFile:
     line = line.replace('."</documents', '"').replace('. ', ' ')
     print(' '.join([word for word in line.lower().split() if len(word) >=3 and word not in stopwords.words('english')]), file = outFile)

如何解决这个问题？

单词不在stopWord中。单词（'english'）

使用比较。

word

或

stopwords中的至少一个值。words（'english'）

不是Unicode值

由于您是从文件中读取，因此此处最有可能的候选项是

word

；对其进行解码，或使用在读取数据时对其进行解码的文件对象：

print(' '.join([word for word in line.lower().split()
                if len(word) >=3 and
                   word.decode('utf8') not in stopwords.words('english')]),
      file = outFile)**

或

其中，以文本模式为您提供一个文件对象，该对象可根据需要进行编码或解码

后者不太容易出错。例如，您测试的是

word

的长度，但实际测试的是字节数。任何包含ASCII码点范围以外字符的单词都会导致每个字符超过一个UTF-8字节，因此

len（word）

与

len（word.decode（'utf8'））

谢谢@martijn pieters，

word.decode（'utf8'）

效果很好！什么更有效率？使用io还是第一种方法？@user275832:我会使用第二种方法；直接处理Unicode值，而不是UTF-8字节。

print(' '.join([word for word in line.lower().split()
                if len(word) >=3 and
                   word.decode('utf8') not in stopwords.words('english')]),
      file = outFile)**

import io

with io.open(fin,'r', encoding='utf8') as inFile,\
        io.open(fout,'w', encoding='utf8') as outFile: