unicode文本文件中的字符串匹配？python_Python_Python 2.7_Unicode_Unicode String_Python Unicode

unicode文本文件中的字符串匹配？python

python python-2.7 unicode

unicode文本文件中的字符串匹配？python,python,python-2.7,unicode,unicode-string,python-unicode,Python,Python 2.7,Unicode,Unicode String,Python Unicode,此代码没有错误，但不适用于不同文件的字符串匹配。有人可以帮助我吗 $i还尝试了方法匹配，但不起作用您需要加载内容文件，而不仅仅是打开它替换以下行： import re, codecs import string import sys stopwords=codecs.open('stopwords_harkat1.txt','r','utf_8') lines=codecs.open('Corpus_v2.txt','r','utf_8') for line in lines: lin

此代码没有错误，但不适用于不同文件的字符串匹配。有人可以帮助我吗

$i还尝试了方法匹配，但不起作用

您需要加载内容文件，而不仅仅是打开它

替换以下行：

import re, codecs
import string
import sys
stopwords=codecs.open('stopwords_harkat1.txt','r','utf_8')
lines=codecs.open('Corpus_v2.txt','r','utf_8')
for line in lines:
    line = line.rstrip().lstrip()
    #print line
    tokens = line.split('\t')
    token=tokens[4]

    if token in stopwords:
            print token

与：

我尝试了一下，但出现了以下错误：回溯（最近一次调用：最后一次）：文件“C:\Users\Desktop\remove stop words\remove\remove\remove.py”，第7行，带有open（codecs.open（'stopwords\u harkat1.txt'、'r'、'utf\u 8'））作为f:TypeError：强制使用Unicode：需要字符串或缓冲区，例如found@msm，前导的

打开（

输入错误。我更新了答案。请查看。

stopwords = codecs.open('stopwords_harkat1.txt','r','utf_8')

with codecs.open('stopwords_harkat1.txt','r','utf_8') as f:
    # assuming one stop word in one line.
    stopwords = set(line.strip() for line in f)

    # Otherwise, use the following line
    # stopwords = set(word for line in f for word in line.split())