Vector 如何对20个新闻组\矢量化数据集执行矢量操作?

Vector 如何对20个新闻组\矢量化数据集执行矢量操作?,vector,scikit-learn,text-database,Vector,Scikit Learn,Text Database,当我通过 newsgroups = fetch_20newsgroups_vectorized(subset='all') labels = newsgroups.target_names target = newsgroups.target target = pd.DataFrame([labels[i] for i in target], columns=['label']) data = newsgroups.data 数据是带有形状的 (18846130107) 如何按目标名称将数据子

当我通过

newsgroups = fetch_20newsgroups_vectorized(subset='all')
labels = newsgroups.target_names
target = newsgroups.target
target = pd.DataFrame([labels[i] for i in target], columns=['label'])
data = newsgroups.data
数据
是带有形状的
(18846130107)


如何按目标名称将数据子集(例如,仅提取“rec.sport.barball”)并对这些稀疏行向量使用向量运算(例如,计算平均向量或距离)?

不幸的是,“按目标名称对数据进行子集设置”选项在
fetch\u 20newsgroups\u vectorated
中不可用,但在中可用
fetch\u 20新闻组
,只需您自己对数据进行矢量化即可

这是你可以做到的

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups_train = fetch_20newsgroups(subset='all',
                                      categories=['rec.sport.baseball'])
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)
# (994, 13986)
阅读更多