Python CountVectorizer用于构建字典以删除多余单词

Python CountVectorizer用于构建字典以删除多余单词,python,pandas,scikit-learn,nlp,countvectorizer,Python,Pandas,Scikit Learn,Nlp,Countvectorizer,我有一个列中的句子列表: sentence I am writing on Stackoverflow because I cannot find a solution to my problem. I am writing on Stackoverflow. I need to show some code. Please see the code below 我想通过它们运行一些文本挖掘和分析,例如获取单词频率。 为此,我采用以下方法: from sklearn.feature_ext

我有一个列中的句子列表:

sentence
I am writing on Stackoverflow because I cannot find a solution to my problem.
I am writing on Stackoverflow. 
I need to show some code. 
Please see the code below
我想通过它们运行一些文本挖掘和分析,例如获取单词频率。 为此,我采用以下方法:

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)

如何将其应用于我的专栏,在构建词汇表后删除额外的停止词?

您可以使用中的
停止词
参数来删除停止词:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
vectorizer.fit_transform(text)
如果要在
dataframe中执行所有预处理:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem.", "I am writing on Stackoverflow."]
df = pd.DataFrame({"text": text})
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
df["counts"] = vectorizer.fit_transform(df["text"]).todense().tolist()
df
                                                text              counts
0  I am writing on Stackoverflow because I cannot...  [1, 1, 1, 1, 1, 1]
1                     I am writing on Stackoverflow.  [0, 0, 0, 0, 1, 1]
在这两种情况下,您都有一个已删除stopwords的vocab:

print(vectorizer.vocabulary_)
{'writing': 5, 'stackoverflow': 4, 'cannot': 0, 'find': 1, 'solution': 3, 'problem': 2}

您可以在中使用
stop_words
param,它将负责删除停止字:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
vectorizer.fit_transform(text)
如果要在
dataframe中执行所有预处理:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem.", "I am writing on Stackoverflow."]
df = pd.DataFrame({"text": text})
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
df["counts"] = vectorizer.fit_transform(df["text"]).todense().tolist()
df
                                                text              counts
0  I am writing on Stackoverflow because I cannot...  [1, 1, 1, 1, 1, 1]
1                     I am writing on Stackoverflow.  [0, 0, 0, 0, 1, 1]
在这两种情况下,您都有一个已删除stopwords的vocab:

print(vectorizer.vocabulary_)
{'writing': 5, 'stackoverflow': 4, 'cannot': 0, 'find': 1, 'solution': 3, 'problem': 2}

谢谢你的解释,谢尔盖!欢迎请注意,CountVectorizer矩阵中的计数是按字母顺序排列的vocab(或vocab dic中的值所给出的值)。谢谢您非常好和清晰的解释,Sergey!欢迎注意,CountVectorizer矩阵中的计数是按字母顺序排列的vocab(或vocab dic中的值给出的值)。