Python 如何在熊猫中正确使用稀疏向量特征和数值特征来训练sklearn模型?
我创建了单词包功能Python 如何在熊猫中正确使用稀疏向量特征和数值特征来训练sklearn模型?,python,pandas,scikit-learn,sparse-matrix,Python,Pandas,Scikit Learn,Sparse Matrix,我创建了单词包功能 from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(ngram_range=(1,5)) all_corpus = X_train["excerpt"].append(X_val["excerpt"]).append(df_test["excerpt"]) vectorizer.fit(all
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,5))
all_corpus = X_train["excerpt"].append(X_val["excerpt"]).append(df_test["excerpt"])
vectorizer.fit(all_corpus)
bag_of_word_feature = vectorizer.transform(X_train["excerpt"])
X_train["count_bag_of_word_feature"] = bag_of_word_feature
我还创建数字特征(每个特征都是一个数字)
这是我的数据框
当我尝试拟合模型时,它不起作用:
regressor = KNeighborsRegressor(10, weights='distance')
regressor.fit(X_train_feature, y_train.to_numpy())
如果我使用任何一个数字特征,它都会起作用
regressor1.fit(X_train_feature[["avg_word_length", "avg_sent_length"]], y_train.to_numpy())
或一袋字特征
regressor2.fit(bag_of_word_feature , y_train.to_numpy())
如何正确连接上述三个功能?矢量器.transform()的输出是一个稀疏矩阵,您不能仅将其强制到数据帧中的列中。您可以使用
bag\u of\u word\u feature.toarray()
将其转换为稠密格式,并将其连接到数据帧,但如果您的数据量很大,则可能不建议这样做
下面我使用一些示例数据:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from scipy.sparse import csr_matrix, hstack
twenty_train = fetch_20newsgroups(subset='train',categories=['sci.med','comp.graphics'])
vectorizer = CountVectorizer()
vectorizer.fit(twenty_train.data)
bag_of_word_feature = vectorizer.transform(twenty_train.data)
假设数据帧中还有两个其他数字特征:
avg_train = pd.DataFrame(np.random.randint(0,100,(len(twenty_train.data),2)),
columns=["avg_word_length", "avg_sent_length"])
我们只需要将其转换为稀疏,然后将其固定:
X_train = hstack([csr_matrix(avg.values),bag_of_word_feature])
y_train = twenty_train.target
regressor = KNeighborsRegressor(10, weights='distance')
regressor.fit(X_train, y_train)
谢谢老兄,我会尽快回复你的
X_train = hstack([csr_matrix(avg.values),bag_of_word_feature])
y_train = twenty_train.target
regressor = KNeighborsRegressor(10, weights='distance')
regressor.fit(X_train, y_train)