Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/320.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何将Reuters-21578数据集作为Python中tokenize funktion的输入参数传递_Python_Nltk_Tokenize_Corpus_Reuters - Fatal编程技术网

如何将Reuters-21578数据集作为Python中tokenize funktion的输入参数传递

如何将Reuters-21578数据集作为Python中tokenize funktion的输入参数传递,python,nltk,tokenize,corpus,reuters,Python,Nltk,Tokenize,Corpus,Reuters,我尝试将Reuters-21578数据集作为输入参数传递到tokenize funktiondef tokenize(text):中,它应该删除停止字、标记化、词干和小写 #!/usr/bin/python3 import nltk import pandas as pd import numpy as np import string from nltk.corpus import reuters from nltk import word_tokenize from nltk.stem.po

我尝试将Reuters-21578数据集作为输入参数传递到tokenize funktion
def tokenize(text):
中,它应该删除停止字、标记化、词干和小写

#!/usr/bin/python3
import nltk
import pandas as pd
import numpy as np
import string
from nltk.corpus import reuters
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re
cachedStopWords = stopwords.words("english")


 for index, i in  enumerate(reuters.fileids()):
    text = reuters.raw(fileids=[i])

    #output in a txt file
#print(text, file=open("output.txt", "a"))


def tokenize(text):
    min_length = 3
    words = map(lambda word: word.lower(), word_tokenize(text));
    words = [word for word in words
                  if word not in cachedStopWords]
    tokens =(list(map(lambda token: PorterStemmer().stem(token),
                  words)));
    p = re.compile('[a-zA-Z]+');
    filtered_tokens =list(filter(lambda token:
                  p.match(token) and len(token)>=min_length,
         tokens));
    return filtered_tokens

result=tokenize(text)
print(result)
因此,我只得到以下结果:

['a.h.a', 'automot', 'technolog', 'corp', 'year', 'net', 'shr', 'shr', 'dilut', 'net', 'rev', 'mln', 'mln']

如果我将整个数据集传递给tokenize函数,那怎么可能呢?

您正在覆盖每个for循环的文本,这就是为什么您会得到属于reuter数据集中最后一条记录的输出。只需对代码做一些小的修改

text = ''
for index, i in  enumerate(reuters.fileids()):
    text += reuters.raw(fileids=[i])