Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/338.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/kubernetes/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Stopwords不删除NLTK中的单词-与原始文本相同_Python_Nltk - Fatal编程技术网

Python Stopwords不删除NLTK中的单词-与原始文本相同

Python Stopwords不删除NLTK中的单词-与原始文本相同,python,nltk,Python,Nltk,在删除特殊字符等后,我对句子进行了标记。Stopword返回文本时不会删除填充词 import nltk import re import string from nltk.corpus import stopwords """ Function to remove special characters etc.""" def remove_characters_before_tokenization(sentence, keep_

在删除特殊字符等后,我对句子进行了标记。Stopword返回文本时不会删除填充词

import nltk
import re
import string
from nltk.corpus import stopwords

""" Function to remove special characters etc."""

def remove_characters_before_tokenization(sentence, keep_apostrophes=False):
    sentence = sentence.strip()
    if keep_apostrophes:
        PATTERN = r'[?|$|&|*|%|@|(|)|~]'
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        PATTERN = r'[^a-zA-Z0-9 ]'
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    return filtered_sentence

""" Generic function to word tokenize"""

def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

Sample= open("Sample.txt", "r") # open a text file 

cleaned_text= remove_characters_before_tokenization(Sample.read())

words=tokenize_text(cleaned_text)  # tokenised word without special characters

""" Function to remove stopwords"""

def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    for token in tokens:
        if  not token in stopword_list:
             filtered_tokens= token
    return filtered_tokens


stop_removed = remove_stopwords(words)
print(stop_removed)



输出“stop_removed”与“words”相同。我认为我在令牌中的FOR循环令牌中犯了错误,但我不确定如何更正它。

filtered\u tokens=token
仅存储一个令牌,您需要使用存储项目集合的数据结构(例如嵌套列表)


filtered_tokens=token
仅存储一个令牌,您需要使用存储项目集合的数据结构(例如嵌套列表)

stop = set(stopwords.words('english'))
   
def remove_stopwords(text):
    filtered_text = [[tok for tok in sent if tok not in stop] for sent in text]
    return filtered_text