Python在读取TREC 2006垃圾邮件跟踪中文语料库时输出被篡改
我正在试着做一个垃圾邮件过滤器,但当我读到那个些文件时,它的输出在终端上是乱码的 我从以下地址下载数据集: 环境 MacOS 10.15 Python 3.6 数据集 python脚本 这是我的python脚本:Python在读取TREC 2006垃圾邮件跟踪中文语料库时输出被篡改,python,python-3.x,python-2.7,Python,Python 3.x,Python 2.7,我正在试着做一个垃圾邮件过滤器,但当我读到那个些文件时,它的输出在终端上是乱码的 我从以下地址下载数据集: 环境 MacOS 10.15 Python 3.6 数据集 python脚本 这是我的python脚本: #编码=UTF-8 path=“/Users/jason/Documents/Note/Python_Data_Analysis_basic/jieba_spam/trec06c/delay/” s=[] f=打开(路径+“/索引”) iter_f=iter(f) str1=“” 对于
#编码=UTF-8
path=“/Users/jason/Documents/Note/Python_Data_Analysis_basic/jieba_spam/trec06c/delay/”
s=[]
f=打开(路径+“/索引”)
iter_f=iter(f)
str1=“”
对于iter\U f中的线路:
如果(第[0]=“H”行或第[1]=“P”行):
持续
其他:
s、 追加(第[5:20]行)
对于s中的i:
spam_path=path+i
f=打开(垃圾邮件路径)
对于f中的行:
打印(行)
- 蟒蛇2.7
。。。
收到:来自新浪网([61.48.9.188])
由spam-gw.ccert.edu.cn(MIMEDefang)提供,ESMTP id为j7VBFQ9v014498
对于2005年9月4日星期日02:14:16+0800(CST)
消息ID:
起自:=?GB2312?B?t7bQob3j?=
主题:=?gb2312?B?tq+7rbPH0fvH68T6ss6806Oh=
致:gong@ccert.edu.cn
内容类型:文本/纯文本;charset=“GB2312”
答复:yana@sina.com
日期:2005年9月4日星期日02:27:04+0800
X优先级:3
X-Mailer:Microsoft Outlook Express 5.50.4133.2400
Ϊ?????й??Ŀ?ͨ?߽????ӵ?????,???????̨?ٶ???Ŀ?صؾٰ???һ????Ļ???»????--"???ӡ??????????"ȫ??Ѳչ?.
??λ?????????Ӵ??????????Ȥ,?????ǿ??Էdz????Ķȹ????????(ʮһ?ƽ??ܼ??Ժ??ʱ?????),????????Ļ???ҵ??˾?????????̻?.
?˴λ???????????̨??????????Ȩ????ʵҵ??չ???????ι?˾???Ҿ?Ӫ
λλ:
回溯(最近一次呼叫最后一次):
文件“/Users/jason/Documents/Note/Python_Data_Analysis_basical/jieba_spam/tf idf.py”,第19行,在
打印(行)
键盘中断
回溯(最近一次呼叫最后一次):
文件“/Users/jason/.atom/packages/atom-python-run/lib/./cp/main.py”,第71行,在
parser.call()
文件“/Users/jason/.atom/packages/atom-python-run/cp/cp/parse.py”,第130行,在调用中
self.\u exitCode=调用(self.\u命令)
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py”,调用中的第172行
返回Popen(*popenargs,**kwargs)。等待()
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py”,第1099行,正在等待
pid,sts=\u eintr\u retry\u调用(os.waitpid,self.pid,0)
文件“/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py”,第125行,在调用
返回函数(*args)
键盘中断
- 蟒蛇3
回溯(最近一次呼叫最后一次):
文件“/Users/jason/Documents/Note/Python_Data_Analysis_basical/jieba_spam/tf idf.py”,第18行,在
对于f中的行:
文件“/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py”,第321行,解码中
(结果,消耗)=自身缓冲区解码(数据,自身错误,最终)
UnicodeDecodeError:“utf-8”编解码器无法解码位置256处的字节0xd4:无效的连续字节
进程返回1(0x1)执行时间:0.197秒
关闭此窗口以继续。。。
应用此更改:
而且会得到
我试图添加一些函数,如encode('gb2312)或decode('utf-8'),但它们只能读取这些文件的一部分
任何建议都会很有帮助。我解决了这个问题,数据集是GB2312,您需要编写一个shell脚本来将这些数据转换为UTF-8 这是我在
/trec06/convert.sh中的shell脚本
用于'ls~/Documents/Note/Python_Data_Analysis_basic/jieba_spam/trec06c/Data'中的目录;做
对于'ls~/Documents/Note/Python_Data_Analysis_basical/jieba_spam/trec06c/Data/$dir'中的文件;做
cd~/Documents/Note/Python_Data_Analysis_basical/jieba_spam/trec06c/Data/$dir/
iconv-f GB2312-t UTF-8$文件>$文件'.txt'
rm-rf$文件
完成
完成
当执行此文件时,您将得到python可以识别的.txt文件。您能在粘贴箱或其他东西中添加一个最小的示例文本并链接到此处吗?
...
Received: from sina.com ([61.48.9.188])
by spam-gw.ccert.edu.cn (MIMEDefang) with ESMTP id j7VBFQ9v014498
for <gong@ccert.edu.cn>; Sun, 4 Sep 2005 02:14:16 +0800 (CST)
Message-ID: <200508311915.j7VBFQ9v014498@spam-gw.ccert.edu.cn>
From: =?GB2312?B?t7bQob3j?= <yana@sina.com>
Subject: =?gb2312?B?tq+7rbPH0fvH68T6ss6806Oh?=
To: gong@ccert.edu.cn
Content-Type: text/plain;charset="GB2312"
Reply-To: yana@sina.com
Date: Sun, 4 Sep 2005 02:27:04 +0800
X-Priority: 3
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
Ϊ?????й??Ŀ?ͨ?߽????ӵ?????,???????̨?ٶ???Ŀ?صؾٰ???һ????Ļ???»????--"???ӡ??????????"ȫ??Ѳչ?.
??λ?????????Ӵ??????????Ȥ,?????ǿ??Էdz????Ķȹ????????(ʮһ?ƽ??ܼ??Ժ??ʱ?????),????????Ļ???ҵ??˾?????????̻?.
?˴λ???????????̨??????????Ȩ????ʵҵ??չ???????ι?˾???Ҿ?Ӫ
????˾????Э??.^C??Ϊ?????ܴ˴λ:
Traceback (most recent call last):
File "/Users/jason/Documents/Note/Python_Data_Analysis_Fundamental/jieba_spam/tf-idf.py", line 19, in <module>
print(line)
KeyboardInterrupt
Traceback (most recent call last):
File "/Users/jason/.atom/packages/atom-python-run/lib/../cp/main.py", line 71, in <module>
parser.call()
File "/Users/jason/.atom/packages/atom-python-run/cp/cp/parse.py", line 130, in call
self._exitCode = call(self._command)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 172, in call
return Popen(*popenargs, **kwargs).wait()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1099, in wait
pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 125, in _eintr_retry_call
return func(*args)
KeyboardInterrupt
Traceback (most recent call last):
File "/Users/jason/Documents/Note/Python_Data_Analysis_Fundamental/jieba_spam/tf-idf.py", line 18, in <module>
for line in f:
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 256: invalid continuation byte
Process returned 1 (0x1) execution time : 0.197 s
Close this window to continue...