Python 撇号变成\x92

Python 撇号变成\x92,python,python-2.7,apostrophe,Python,Python 2.7,Apostrophe,mycorpus.txt Human where's machine interface for lab abc computer applications A where's survey of user opinion of computer system response time let's ain't there's stopwords.txt Human where's machine interface for lab abc computer applications

mycorpus.txt

Human where's machine interface for lab abc computer applications   
A where's survey of user opinion of computer system response time
let's
ain't
there's
stopwords.txt

Human where's machine interface for lab abc computer applications   
A where's survey of user opinion of computer system response time
let's
ain't
there's
下面的代码

corpus = set()
for line in open("path\\to\\mycorpus.txt"):
    corpus.update(set(line.lower().split()))
print corpus

stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
    stoplist.add(line.lower().strip())
print stoplist
给出以下输出

set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])
为什么撇号在第二组中变成了\x92???

window-1252编码中的代码点92(十六进制)是Unicode代码点2019(十六进制),它是“右单引号”。这看起来很像一个撇号,很可能是您在
stopwords.txt
中的实际字符,我从python在中的解释方式猜到,它是在windows-1252中编码的,或者是一种共享ASCII和
码点值的编码


“vs”

如果要编写ASCII文本,请不要使用Microsoft的编辑器。如果您想使用它们,那么您必须处理cp1252(其中还包括“右引号”)。那么在第一组中,为什么显示“where's”而不是“where\x92s”?@PankajSinghal:可能是因为您在第一个文件中确实有ASCII撇号字符。要确认这一点,请使用诸如hextump之类的工具来验证两个文件中的实际字节。ya,我发现字符之间存在差异。那么,我应该怎么做才能让它看起来像“不是”而不是“不是”?@PankajSinghal:最简单的方法是使用文本编辑器或
sed
或类似工具编辑
stop words.txt
,并将所有
'
替换为