从Python中的一组句子中删除最常用的单词_Python_Nltk_Stop Words

从Python中的一组句子中删除最常用的单词

python

从Python中的一组句子中删除最常用的单词,python,nltk,stop-words,Python,Nltk,Stop Words,我在np.array中有5个句子，我想找到出现的最常见的n个单词。例如，如果n=5，我想要5个最常见的单词。我举了一个例子： 0 rt my mother be on school amp race 1 rt i am a red hair down and its a great 2 rt my for your every day and my chocolate 3 rt i am that red human being a man 4 rt my moth

我在np.array中有5个句子，我想找到出现的最常见的n个单词。例如，如果n=5，我想要5个最常见的单词。我举了一个例子：

0    rt my mother be on school amp race
1    rt i am a red hair down and its a great
2    rt my for your every day and my chocolate
3    rt i am that red human being a man
4    rt my mother be on school and wear

下面是我用来获取最常见的n个单词的代码

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

A = np.array(["rt my mother be on school amp race", 
              "rt i am a red hair down and its a great", 
              "rt my for your every day and my chocolate",
              "rt i am that red human being a man",
              "rt my mother be on school and wear"])

        n = 5
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform(A)

        vocabulary = vectorizer.get_feature_names()
        ind = np.argsort(X.toarray().sum(axis=0))[-n:]

        top_n_words = [vocabulary[a] for a in ind]

        print(top_n_words)

结果如下：

['school', 'am', 'and', 'my', 'rt']

然而，我想要的是忽略这些最常见单词中的“and”、“am”和“my”等停止词。如何实现这一点？

您只需将参数stop_words='english'包含到CountVectorizer中即可

你现在应该得到：

['wear', 'mother', 'red', 'school', 'rt']

请参阅此处的文档：

您只需将参数stop_words='english'包含到CountVectorizer中即可

你现在应该得到：

['wear', 'mother', 'red', 'school', 'rt']

请参阅此处的文档：

谢谢。但我仍然希望它打印5个字，忽略停止字。更新，请检查。谢谢。但我仍然希望它打印5个字，忽略停止字。更新，请检查。

import numpy as np
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer

stop_words = set(stopwords.words('english'))

A = np.array(["rt my mother be on school amp race",
              "rt i am a red hair down and its a great",
              "rt my for your every day and my chocolate",
              "rt i am that red human being a man",
              "rt my mother be on school and wear"])
data = []
for i in A:
    d = i.split()
    s = ""
    for w in d:
        if w not in stop_words:
            s+=" "+w
    s = s.strip()
    data.append(s)

vect = CountVectorizer()
x = vect.fit_transform(data)
keyword = vect.get_feature_names()
list = x.toarray()
list = np.transpose(list)
l_total=[]
for i in list:
    l_total.append(sum(i))
n=len(keyword)
for i in range(n):
    for j in range(0, n - i - 1):
        if l_total[j] > l_total[j + 1]:
            l_total[j], l_total[j + 1] = l_total[j + 1], l_total[j]
            keyword[j], keyword[j + 1] = keyword[j + 1], keyword[j]
keyword.reverse()
print(keyword[:5])