Python 加载时出错&；使用NLTK阅读自定义新闻组语料库_Python_Regex_Nlp_Nltk_Corpus

Python 加载时出错&；使用NLTK阅读自定义新闻组语料库

python regex nlp

Python 加载时出错&；使用NLTK阅读自定义新闻组语料库,python,regex,nlp,nltk,corpus,Python,Regex,Nlp,Nltk,Corpus,我试图用NLTK语料库阅读器加载20个新闻组语料库，然后从所有文档中提取单词并标记它们。但当我试图建立单词提取和标记列表时，它显示了错误以下是代码： import nltk import random from nltk.tokenize import word_tokenize newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader( r"C:\nltk_data\corpora\20newsgroups"

我试图用NLTK语料库阅读器加载20个新闻组语料库，然后从所有文档中提取单词并标记它们。但当我试图建立单词提取和标记列表时，它显示了错误
以下是代码：

import nltk import random from nltk.tokenize import word_tokenize newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader( r"C:\nltk_data\corpora\20newsgroups", r'(?!\.).*\.txt', cat_pattern=r'(not_sports|sports)/.*', encoding="utf8") documents = [(list(newsgroups.words(fileid)), category) for category in newsgroups.categories() for fileid in newsgroups.fileids(category)] random.shuffle(documents)
相应的误差为：

--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-10-de2a1a6859ea> in <module>() 1 documents = [(list(newsgroups.words(fileid)), category) ----> 2 for category in newsgroups.categories() 3 for fileid in newsgroups.fileids(category)] 4 5 random.shuffle(documents) <ipython-input-10-de2a1a6859ea> in <listcomp>(.0) 1 documents = [(list(newsgroups.words(fileid)), category) 2 for category in newsgroups.categories() ----> 3 for fileid in newsgroups.fileids(category)] 4 5 random.shuffle(documents) C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self) 231 # iterate_from() sets self._len when it reaches the end 232 # of the file: --> 233 for tok in self.iterate_from(self._toknum[-1]): pass 234 return self._len 235 C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok) 294 self._current_toknum = toknum 295 self._current_blocknum = block_index --> 296 tokens = self.read_block(self._stream) 297 assert isinstance(tokens, (tuple, list, AbstractLazySequence)), ( 298 'block reader %s() should return list or tuple.' % C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py in _read_word_block(self, stream) 120 words = [] 121 for i in range(20): # Read 20 lines at a time. --> 122 words.extend(self._word_tokenizer.tokenize(stream.readline())) 123 return words 124 C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in readline(self, size) 1166 while True: 1167 startpos = self.stream.tell() - len(self.bytebuffer) -> 1168 new_chars = self._read(readsize) 1169 1170 # If we're at a '\r', then read one extra character, since C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _read(self, size) 1398 1399 # Decode the bytes into unicode characters -> 1400 chars, bytes_decoded = self._incr_decode(bytes) 1401 1402 # If we got bytes but couldn't decode any, then read further. C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _incr_decode(self, bytes) 1429 while True: 1430 try: -> 1431 return self.decode(bytes, 'strict') 1432 except UnicodeDecodeError as exc: 1433 # If the exception occurs at the end of the string, C:\ProgramData\Anaconda3\lib\encodings\utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder): UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6: invalid start byte

--------------------------------------------------------------------------- UnicodeDecodeError回溯（最近一次呼叫最后一次）在（） 1文档=[（列表（新闻组.单词（文件ID）），类别） ---->2用于新闻组中的类别。类别（） 3用于新闻组中的文件ID。文件ID（类别）] 4. 5.随机洗牌（文件）英寸（.0） 1文档=[（列表（新闻组.单词（文件ID）），类别） 2用于新闻组中的类别。类别（） ---->3用于新闻组中的文件ID。文件ID（类别）] 4. 5.随机洗牌（文件） C:\ProgramData\Anaconda3\lib\site packages\nltk\corpus\reader\util.py in\uuuuu len\uuuu（self） 231#iterate_from（）在到达末尾时设置self._len 文件的232#部分： -->233对于self中的tok。从（self.\u toknum[-1]）迭代：通过 234返回自我。\u len 235 C:\ProgramData\Anaconda3\lib\site packages\nltk\corpus\reader\util.py in iterate\u from（self，start\u tok） 294自身。_电流_toknum=toknum 295 self.\u current\u blocknum=块索引 -->296令牌=自读块（自读流） 297断言isinstance（标记，（元组、列表、抽象序列））( 298“块读取器%s（）应返回列表或元组”。% C:\ProgramData\Anaconda3\lib\site packages\nltk\corpus\reader\plaintext.py在\u read\u word\u块中（self，stream） 120字=[] 121表示范围内的i（20）：#一次读20行。 -->122 words.extend（self.\u word\u tokenizer.tokenize（stream.readline（））） 123返回单词 124 C:\ProgramData\Anaconda3\lib\site packages\nltk\data.py在readline中（self，size） 1166虽然正确： 1167 startpos=self.stream.tell（）-len（self.bytebuffer） ->1168新字符=自读（readsize） 1169 1170#如果我们在一个'\r'，那么多读一个字符，因为 C:\ProgramData\Anaconda3\lib\site packages\nltk\data.py in\u read（self，size） 1398 1399#将字节解码为unicode字符 ->1400个字符，已解码字节=自身。已解码字节（字节） 1401 1402#如果我们得到了字节，但无法解码任何字节，那么请进一步阅读。 C:\ProgramData\Anaconda3\lib\site packages\nltk\data.py in\u incr\u decode（self，字节） 1429虽然正确： 1430尝试： ->1431返回自解码（字节，“严格”） 1432除UNICEDECODEDEERROR作为exc外： 1433#如果异常发生在字符串末尾，解码中的C:\ProgramData\Anaconda3\lib\encodings\utf_8.py（输入，错误） 14 15 def解码（输入，错误='strict'）： --->16返回编解码器。utf_8_解码（输入，错误，真） 17 18类递增编码器（编解码器.递增编码器）： UnicodeDecodeError:“utf-8”编解码器无法解码位置6中的字节0xa0:无效的开始字节
我已尝试将语料库读取器中的编码更改为ascii和utf16。这也不起作用。我不确定我提供的正则表达式是否正确。20个新闻组语料库中的文件名是两个数字的形式，由连字符（-）分隔，例如：
5-53286
102-53553
8642-104983
我担心的第二件事是，在读取文档内容进行特征提取时，是否会从文档内容中生成错误。以下是20个新闻组语料库中文档的外观：
发件人：bil@okcforum.osrhe.edu（比尔·康纳）主题：Re：自由道德代理
迪安·卡福威茨(decay@cbnewsj.cb.att.com)写：：>：>我想你让无神论者的神话
：好的开始。我马上意识到你对……不感兴趣讨论，然后你会对我喋喋不休。我会的：太多了希利女士似乎有一个合理的答案：合理和合理理性的做事方式。比如说，你不是神造论者吗前一段时间谁做了很多关于进化的傻话
：噢，天哪，那我们现在一定是在谈论基督教神话了希望与一个合理、合乎逻辑的人讨论某事：人，但是你身边似乎只有一个重复：相同的无聊的神话我已经看过一千遍了你剩下的话，除非我发现了什么：那接近于回答，因为它们只是重复：一些无趣的东西教义或其他，不包含思想：根本没有
当前位置比尔，我得祝贺你。你不会知道的如果它咬到了你的球，那就是合乎逻辑的论点。比如：持续的缺乏面对反复出现的功能：尝试帮助您学习（我在本论坛和过去的其他论坛中看到）谈到天赋：这远远超出了我自己微薄的能力。我只是不知道：我似乎有能力忽视外界影响
：迪安·卡福威茨
院长
重新阅读你的评论，你认为仅仅描述一个争论和反驳是一样的吗？你认为这是人性的吗攻击足以表明除您不赞成之外的任何观点我？你有什么贡献吗
账单

来自：cmk@athena.mit.edu（Charles M Kozierok）主题：回复：杰克·莫里斯文章中cs902043@ariel.yorku.ca（肖恩·卢丁顿） From: cmk@athena.mit.edu (Charles M Kozierok) Subject: Re: Jack Morris In article <1993Apr19.024222.11181@newshub.ariel.yorku.ca> cs902043@ariel.yorku.ca (SHAWN LUDDINGTON) writes: } In article <1993Apr18.032345.5178@cs.cornell.edu> tedward@cs.cornell.edu (Edward [Ted] Fischer) writes: } >In article <1993Apr18.030412.1210@mnemosyne.cs.du.edu> gspira@nyx.cs.du.edu (Greg Spira) writes: } >>Howard_Wong@mindlink.bc.ca (Howard Wong) writes: } >> } >>>Has Jack lost a bit of his edge? What is the worst start Jack Morris has had? } >> } >>Uh, Jack lost his edge about 5 years ago, and has had only one above } >>average year in the last 5. } > } >Again goes to prove that it is better to be good than lucky. You can } >count on good tomorrow. Lucky seems to be prone to bad starts (and a } >bad finish last year :-). } > } >(Yes, I am enjoying every last run he gives up. Who was it who said } >Morris was a better signing than Viola?) } } Hey Valentine, I don't see Boston with any world series rings on their } fingers. oooooo. cheap shot. :^) } Damn, Morris now has three and probably the Hall of Fame in his } future. who cares? he had two of them before he came to Toronto; and if the Jays had signed Viola instead of Morris, it would have been Frank who won 20 and got the ring. and he would be on his way to 20 this year, too. } Therefore, I would have to say Toronto easily made the best } signing. your logic is curious, and spurious. there is no reason to believe that Viola wouldn't have won as many games had *he* signed with Toronto. when you compare their stupid W-L records, be sure to compare their team's offensive averages too. now, looking at anything like the Morris-Viola sweepstakes a year later is basically hindsight. but there were plenty of reasons why it should have been apparent that Viola was the better pitcher, based on previous recent years and also based on age (Frank is almost 5 years younger! how many knew that?). people got caught up in the '91 World Series, and then on Morris' 21 wins last year. wins are the stupidest, most misleading statistic in baseball, far worse than RBI or R. that he won 21 just means that the Jays got him a lot of runs. the only really valid retort to Valentine is: weren't the Red Sox trying to get Morris too? oh, sure, they *said* Viola was their first choice afterwards, but what should we have expected they would say? } And don't tell me Boston will win this year. They won't } even be in the top 4 in the division, more like 6th. if this is true, it won't be for lack of contribution by Viola, so who cares? -*- charles from sklearn.datasets import fetch_20newsgroups cats = ['alt.atheism', 'sci.space'] newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)