Python 用卡方检验列出语料库中所有拒绝零假设的词_Python_Scikit Learn_Nlp_Chi Squared

Python 用卡方检验列出语料库中所有拒绝零假设的词

python scikit-learn nlp

Python 用卡方检验列出语料库中所有拒绝零假设的词,python,scikit-learn,nlp,chi-squared,Python,Scikit Learn,Nlp,Chi Squared,我有一个脚本，其中列出了前n个单词（具有较高卡方值的单词）。但是，不是提取固定数量的单词，而是提取p值小于0.05的所有单词，即拒绝无效假设这是我的密码： from sklearn.feature_selection import chi2 #vectorize top 100000 words tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3)) X_tfidf = tfidf.fit_transform(df.re

我有一个脚本，其中列出了前n个单词（具有较高卡方值的单词）。但是，不是提取固定数量的单词，而是提取p值小于0.05的所有单词，即拒绝无效假设

这是我的密码：

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2score = chi2(X_tfidf, y)[0]
scores = list(zip(tfidf.get_feature_names(), chi2score))
chi2 = sorted(scores, key=lambda x:x[1])
allchi2 = list(zip(*chi2))

#lists top 20 words
allchi2 = allchi2[0][-20:]

因此，在这种情况下，我不想列出前20个单词，而是想要所有拒绝无效假设的单词，即评论中依赖于情绪类别（积极或消极）的所有单词 #矢量化前100000个单词 tfidf=TFIDFvectorier（最大特性=100000，ngram范围=（1,3）） X\u tfidf=tfidf.fit\u转换（df.review\u文本） y=df.label chi2_分数，pval_分数=chi2（X_tfidf，y）

feature\u pval\u items=filter（lambda x:x[1]问题与

keras

无关-请不要垃圾邮件发送不相关的标签（已删除）。

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2_score, pval_score = chi2(X_tfidf, y)
feature_pval_items = filter(lambda x:x[1]<0.05, zip(tfidf.get_feature_names(), pval_score))
you_want_feature_pval_items = sorted(feature_pval_items, key=lambda x:x[1])