Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/fortran/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 加载时出错&;使用NLTK阅读自定义新闻组语料库_Python_Regex_Nlp_Nltk_Corpus - Fatal编程技术网

Python 加载时出错&;使用NLTK阅读自定义新闻组语料库

Python 加载时出错&;使用NLTK阅读自定义新闻组语料库,python,regex,nlp,nltk,corpus,Python,Regex,Nlp,Nltk,Corpus,我试图用NLTK语料库阅读器加载20个新闻组语料库,然后从所有文档中提取单词并标记它们。但当我试图建立单词提取和标记列表时,它显示了错误 以下是代码: import nltk import random from nltk.tokenize import word_tokenize newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader( r"C:\nltk_data\corpora\20newsgroups"

我试图用NLTK语料库阅读器加载20个新闻组语料库,然后从所有文档中提取单词并标记它们。但当我试图建立单词提取和标记列表时,它显示了错误

以下是代码

import nltk
import random

from nltk.tokenize import word_tokenize

newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
    r"C:\nltk_data\corpora\20newsgroups",
    r'(?!\.).*\.txt', 
    cat_pattern=r'(not_sports|sports)/.*',
    encoding="utf8")

documents = [(list(newsgroups.words(fileid)), category)
             for category in newsgroups.categories()
             for fileid in newsgroups.fileids(category)]

random.shuffle(documents)
相应的误差为:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-de2a1a6859ea> in <module>()
      1 documents = [(list(newsgroups.words(fileid)), category)
----> 2              for category in newsgroups.categories()
      3              for fileid in newsgroups.fileids(category)]
      4 
      5 random.shuffle(documents)

<ipython-input-10-de2a1a6859ea> in <listcomp>(.0)
      1 documents = [(list(newsgroups.words(fileid)), category)
      2              for category in newsgroups.categories()
----> 3              for fileid in newsgroups.fileids(category)]
      4 
      5 random.shuffle(documents)

C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self)
    231             # iterate_from() sets self._len when it reaches the end
    232             # of the file:
--> 233             for tok in self.iterate_from(self._toknum[-1]): pass
    234         return self._len
    235 

C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
    294             self._current_toknum = toknum
    295             self._current_blocknum = block_index
--> 296             tokens = self.read_block(self._stream)
    297             assert isinstance(tokens, (tuple, list, AbstractLazySequence)), (
    298                 'block reader %s() should return list or tuple.' %

C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py in _read_word_block(self, stream)
    120         words = []
    121         for i in range(20): # Read 20 lines at a time.
--> 122             words.extend(self._word_tokenizer.tokenize(stream.readline()))
    123         return words
    124 

C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in readline(self, size)
   1166         while True:
   1167             startpos = self.stream.tell() - len(self.bytebuffer)
-> 1168             new_chars = self._read(readsize)
   1169 
   1170             # If we're at a '\r', then read one extra character, since

C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _read(self, size)
   1398 
   1399         # Decode the bytes into unicode characters
-> 1400         chars, bytes_decoded = self._incr_decode(bytes)
   1401 
   1402         # If we got bytes but couldn't decode any, then read further.

C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _incr_decode(self, bytes)
   1429         while True:
   1430             try:
-> 1431                 return self.decode(bytes, 'strict')
   1432             except UnicodeDecodeError as exc:
   1433                 # If the exception occurs at the end of the string,

C:\ProgramData\Anaconda3\lib\encodings\utf_8.py in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6: invalid start byte
---------------------------------------------------------------------------
UnicodeDecodeError回溯(最近一次呼叫最后一次)
在()
1文档=[(列表(新闻组.单词(文件ID)),类别)
---->2用于新闻组中的类别。类别()
3用于新闻组中的文件ID。文件ID(类别)]
4.
5.随机洗牌(文件)
英寸(.0)
1文档=[(列表(新闻组.单词(文件ID)),类别)
2用于新闻组中的类别。类别()
---->3用于新闻组中的文件ID。文件ID(类别)]
4.
5.随机洗牌(文件)
C:\ProgramData\Anaconda3\lib\site packages\nltk\corpus\reader\util.py in\uuuuu len\uuuu(self)
231#iterate_from()在到达末尾时设置self._len
文件的232#部分:
-->233对于self中的tok。从(self.\u toknum[-1])迭代:通过
234返回自我。\u len
235
C:\ProgramData\Anaconda3\lib\site packages\nltk\corpus\reader\util.py in iterate\u from(self,start\u tok)
294自身。_电流_toknum=toknum
295 self.\u current\u blocknum=块索引
-->296令牌=自读块(自读流)
297断言isinstance(标记,(元组、列表、抽象序列))(
298“块读取器%s()应返回列表或元组”。%
C:\ProgramData\Anaconda3\lib\site packages\nltk\corpus\reader\plaintext.py在\u read\u word\u块中(self,stream)
120字=[]
121表示范围内的i(20):#一次读20行。
-->122 words.extend(self.\u word\u tokenizer.tokenize(stream.readline()))
123返回单词
124
C:\ProgramData\Anaconda3\lib\site packages\nltk\data.py在readline中(self,size)
1166虽然正确:
1167 startpos=self.stream.tell()-len(self.bytebuffer)
->1168新字符=自读(readsize)
1169
1170#如果我们在一个'\r',那么多读一个字符,因为
C:\ProgramData\Anaconda3\lib\site packages\nltk\data.py in\u read(self,size)
1398
1399#将字节解码为unicode字符
->1400个字符,已解码字节=自身。已解码字节(字节)
1401
1402#如果我们得到了字节,但无法解码任何字节,那么请进一步阅读。
C:\ProgramData\Anaconda3\lib\site packages\nltk\data.py in\u incr\u decode(self,字节)
1429虽然正确:
1430尝试:
->1431返回自解码(字节,“严格”)
1432除UNICEDECODEDEERROR作为exc外:
1433#如果异常发生在字符串末尾,
解码中的C:\ProgramData\Anaconda3\lib\encodings\utf_8.py(输入,错误)
14
15 def解码(输入,错误='strict'):
--->16返回编解码器。utf_8_解码(输入,错误,真)
17
18类递增编码器(编解码器.递增编码器):
UnicodeDecodeError:“utf-8”编解码器无法解码位置6中的字节0xa0:无效的开始字节
我已尝试将语料库读取器中的编码更改为asciiutf16。这也不起作用。我不确定我提供的正则表达式是否正确。20个新闻组语料库中的文件名是两个数字的形式,由连字符(-)分隔,例如:

5-53286

102-53553

8642-104983

我担心的第二件事是,在读取文档内容进行特征提取时,是否会从文档内容中生成错误。 以下是20个新闻组语料库中文档的外观:

发件人:bil@okcforum.osrhe.edu(比尔·康纳)主题:Re:自由道德 代理

迪安·卡福威茨(decay@cbnewsj.cb.att.com)写::>:>我想 你让无神论者的神话

:好的开始。我马上意识到你对……不感兴趣 讨论,然后你会对我喋喋不休。我会的:太多了 希利女士似乎有一个合理的答案:合理和合理 理性的做事方式。比如说,你不是神造论者吗 前一段时间谁做了很多关于进化的傻话

:噢,天哪,那我们现在一定是在谈论基督教神话了 希望与一个合理、合乎逻辑的人讨论某事:人,但是 你身边似乎只有一个重复:相同的 无聊的神话我已经看过一千遍了 你剩下的话,除非我发现了什么:那接近于 回答,因为它们只是重复:一些无趣的东西 教义或其他,不包含思想:根本没有

当前位置比尔,我得祝贺你。你不会知道的 如果它咬到了你的球,那就是合乎逻辑的论点。比如:持续的缺乏 面对反复出现的功能:尝试帮助您 学习(我在本论坛和过去的其他论坛中看到) 谈到天赋:这远远超出了我自己微薄的能力。 我只是不知道:我似乎有能力忽视外界 影响

:迪安·卡福威茨

院长

重新阅读你的评论,你认为仅仅描述一个 争论和反驳是一样的吗?你认为这是人性的吗 攻击足以表明除您不赞成之外的任何观点 我?你有什么贡献吗

账单

来自:cmk@athena.mit.edu(Charles M Kozierok)主题:回复:杰克·莫里斯
文章中cs902043@ariel.yorku.ca(肖恩·卢丁顿)
From: cmk@athena.mit.edu (Charles M Kozierok) Subject: Re: Jack Morris

In article <1993Apr19.024222.11181@newshub.ariel.yorku.ca> cs902043@ariel.yorku.ca (SHAWN LUDDINGTON) writes: } In article <1993Apr18.032345.5178@cs.cornell.edu> tedward@cs.cornell.edu (Edward [Ted] Fischer) writes: } >In article <1993Apr18.030412.1210@mnemosyne.cs.du.edu> gspira@nyx.cs.du.edu (Greg Spira) writes: } >>Howard_Wong@mindlink.bc.ca (Howard Wong) writes: }
>> } >>>Has Jack lost a bit of his edge? What is the worst start Jack Morris has had? } >> } >>Uh, Jack lost his edge about 5 years ago, and has had only one above } >>average year in the last 5. } > } >Again goes to prove that it is better to be good than lucky.  You can }
>count on good tomorrow.  Lucky seems to be prone to bad starts (and a } >bad finish last year :-). } > } >(Yes, I am enjoying every last run he gives up.  Who was it who said } >Morris was a better signing than Viola?) }  } Hey Valentine, I don't see Boston with any world series rings on their } fingers.

oooooo. cheap shot. :^)

} Damn, Morris now has three and probably the Hall of Fame in his  } future.

who cares? he had two of them before he came to Toronto; and if the Jays had signed Viola instead of Morris, it would have been Frank who won 20 and got the ring. and he would be on his way to 20 this year, too.

} Therefore, I would have to say Toronto easily made the best  } signing.

your logic is curious, and spurious.

there is no reason to believe that Viola wouldn't have won as many games had *he* signed with Toronto. when you compare their stupid W-L records, be sure to compare their team's offensive averages too.


now, looking at anything like the Morris-Viola sweepstakes a year later is basically hindsight. but there were plenty of reasons why it should have been apparent that Viola was the better pitcher, based on previous recent years and also based on age (Frank is almost 5 years younger! how many knew that?). people got caught up in the '91 World Series, and then on Morris' 21 wins last year. wins are the stupidest, most misleading statistic in baseball, far worse than RBI or R. that he won 21 just means that the Jays got him a lot of runs.

the only really valid retort to Valentine is: weren't the Red Sox trying to get Morris too? oh, sure, they *said* Viola was their first choice afterwards, but what should we have expected they would say?

} And don't tell me Boston will win this year.  They won't  } even be in the top 4 in the division, more like 6th.

if this is true, it won't be for lack of contribution by Viola, so who cares?

-*- charles
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)