从字符串列表中删除常用词。Python NLTK_Python_Pandas_Nltk

从字符串列表中删除常用词。Python NLTK

python pandas

从字符串列表中删除常用词。Python NLTK,python,pandas,nltk,Python,Pandas,Nltk,我试图从python数据帧中的一组字符串（文本）中删除常用词列表。数据帧如下所示 ['Item', 'Label', 'Comment'] 我已经删除了停止词，但我做了一个词云，还有一些更常见的词，我想删除，以更好地了解这个问题这是我目前的工作代码，它做得很好，但还不够好 # This recieves a sentence 1 at a time # Use a loop if you want to process a dataset or a lambda def nlp_prepr

我试图从python数据帧中的一组字符串（文本）中删除常用词列表。数据帧如下所示

 ['Item', 'Label', 'Comment']

我已经删除了停止词，但我做了一个词云，还有一些更常见的词，我想删除，以更好地了解这个问题

这是我目前的工作代码，它做得很好，但还不够好

# This recieves a sentence 1 at a time
# Use a loop if you want to process a dataset or a lambda
def nlp_preprocess(text, stopwords, lemmatizer, wordnet_map):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', text)
    # Remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    # Remove stop words like and is a and the
    text = " ".join([word for word in text.split() if word not in stopwords])
    # Find base word for all words in the sentence
    pos_tagged_text = nltk.pos_tag(text.split())
    text = " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
    return text

def full_nlp_text_process(df, pandas_parms, stopwords, lemmatizer, wordnet_map):
    data = preprocess_dataframe(df, pandas_params)
    nlp_data = data.copy()
    nlp_data["ProComment"] = nlp_data['Comment'].apply(lambda x: nlp_preprocess(x, stopword, lemmatizer, wordnet_map))
    return data, nlp_data

我知道我想要类似的东西，但我不知道我应该如何把它放在那里删除单词，以及我应该把它放在哪里（即在文本处理或数据帧处理中）\

如果您知道如何删除

stopwords

，请参考此内容，然后使用要删除的单词创建列表，并像在Python中使用

stopwords

一样使用它。最好使用要保留的单词创建新列表，而不是从

for

-循环中使用的列表中删除单词。就像你删除停止词一样-你用你想保留的词创建了一个新的列表。我的问题是我应该把它放在哪里。因为我的清理代码一行一行地工作，这可能无法获得整个文档的所有常用词。但我还是不能100%确定这是否正确，因为这是一个假设

fdist2 = nltk.FreqDist(text)
most_list = fdist2.most_common(10)
# Somewhere else
for t in text:
   if t in most_list: text.remove(t)