Python NLTK的一些问题_Python_Tokenize_Nltk_Counting_Corpus

Python NLTK的一些问题

python

Python NLTK的一些问题,python,tokenize,nltk,counting,corpus,Python,Tokenize,Nltk,Counting,Corpus,我对Python和NLTK非常陌生，但我有一个问题。我在写一些东西，从自制的语料库中只提取超过7个字符的单词。但事实证明它提取了每个单词。。。有人知道我做错了什么吗 loc="C:\Users\Dell\Desktop\CORPUS" Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r '(Shakespeare|Milton)/.*) def long_words(corpus)

我对Python和NLTK非常陌生，但我有一个问题。我在写一些东西，从自制的语料库中只提取超过7个字符的单词。但事实证明它提取了每个单词。。。有人知道我做错了什么吗

loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r '(Shakespeare|Milton)/.*)
def long_words(corpus)
    for cat in corpus.categories():
        fileids=corpus.fileids(categories=cat)
        words=corpus.words(fileids)
         long_tokens=[]
         words2=set(words)
         if len(words2) >=7:
             long_tokens.append(words2)


Print long_tokens

谢谢大家

更换

if len(words2) >=7:
    long_tokens.append(words2)

与：

说明：您所做的是添加由

corpus.words（fileid）

生成的所有单词（标记），如果单词数至少为7（因此我认为始终适用于您的语料库）。您真正想要做的是从标记集中筛选出少于7个字符的单词，并将剩余的长单词附加到

long\u标记中
您的函数应该返回结果-具有7个或更多字符的标记。我假设您创建和处理CategorizedPlaintextCorpusReader的方法是正确的：
loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)

def long_words(corpus = Corpus):
    long_tokens=[]
    for cat in corpus.categories():
        fileids = corpus.fileids(categories=cat)
        words = corpus.words(fileids)
        long_tokens += [w for w in set(words) if len(w) >= 7]
    return set(long_tokens)

print "\n".join(long_words())

以下是您在评论中提出的问题的答案：
for loc in ['cat1','cat2']:
  print len(long_words(corpus=CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)), 'words over 7 in', loc

替换
if len(words2) >=7:
    long_tokens.append(words2)

与：
说明：您所做的是添加由corpus.words（fileid）
生成的所有单词（标记），如果单词数至少为7（因此我认为始终适用于您的语料库）。您真正想要做的是从标记集中筛选出少于7个字符的单词，并将剩余的长单词附加到long\u标记中
您的函数应该返回结果-具有7个或更多字符的标记。我假设您创建和处理CategorizedPlaintextCorpusReader的方法是正确的：
loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)

def long_words(corpus = Corpus):
    long_tokens=[]
    for cat in corpus.categories():
        fileids = corpus.fileids(categories=cat)
        words = corpus.words(fileids)
        long_tokens += [w for w in set(words) if len(w) >= 7]
    return set(long_tokens)

print "\n".join(long_words())

以下是您在评论中提出的问题的答案：
for loc in ['cat1','cat2']:
  print len(long_words(corpus=CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)), 'words over 7 in', loc

我得到的结果是：“。我做错了什么？还有一件事，我还需要写一些东西来计算每个语料库的长单词数，但是len（列表）不起作用，因为这会给我一个总的列表，有什么想法吗？你的程序一开始似乎不正确，我会编辑答案，让你看看它应该是什么样子-我现在无法测试，抱歉，但希望你能明白。奇怪的是，我得到了错误“TypeError:Unhabable type:'list'”，以前从未遇到过这个错误。我认为append
是不正确的，我只是在答案中用+=
操作符替换了它，检查一下。是的，效果好多了，我不得不做一些小的更改，但现在可以了。现在我唯一不明白的是，我现在如何能单独计算。代码必须给出一个结果，如“1类中7个单词超过100个”和“2类中7个单词超过200个”，已经非常感谢了，我会支持你的答案，但我没有必要的声誉，是的，我得到的结果是：“”。我做错了什么？还有一件事，我还需要写一些东西来计算每个语料库的长单词数，但是len（列表）不起作用，因为这会给我一个总的列表，有什么想法吗？你的程序一开始似乎不正确，我会编辑答案，让你看看它应该是什么样子-我现在无法测试，抱歉，但希望你能明白。奇怪的是，我得到了错误“TypeError:Unhabable type:'list'”，以前从未遇到过这个错误。我认为append
是不正确的，我只是在答案中用+=
操作符替换了它，检查一下。是的，效果好多了，我不得不做一些小的更改，但现在可以了。现在我唯一不明白的是，我现在如何能单独计算。代码必须给出一个类似“1类中7个单词超过100个”和“2类中7个单词超过200个”的结果已经非常感谢了，我会投票支持你的答案，但我还没有必要的声誉