Python 如何在熊猫中正确使用稀疏向量特征和数值特征来训练sklearn模型?

Python 如何在熊猫中正确使用稀疏向量特征和数值特征来训练sklearn模型?,python,pandas,scikit-learn,sparse-matrix,Python,Pandas,Scikit Learn,Sparse Matrix,我创建了单词包功能 from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(ngram_range=(1,5)) all_corpus = X_train["excerpt"].append(X_val["excerpt"]).append(df_test["excerpt"]) vectorizer.fit(all

我创建了单词包功能

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,5))
all_corpus = X_train["excerpt"].append(X_val["excerpt"]).append(df_test["excerpt"])
vectorizer.fit(all_corpus)
bag_of_word_feature = vectorizer.transform(X_train["excerpt"])
X_train["count_bag_of_word_feature"] = bag_of_word_feature 
我还创建数字特征(每个特征都是一个数字)

这是我的数据框

当我尝试拟合模型时,它不起作用:

regressor = KNeighborsRegressor(10, weights='distance')
regressor.fit(X_train_feature, y_train.to_numpy())
如果我使用任何一个数字特征,它都会起作用

regressor1.fit(X_train_feature[["avg_word_length", "avg_sent_length"]], y_train.to_numpy())
或一袋字特征

regressor2.fit(bag_of_word_feature , y_train.to_numpy())

如何正确连接上述三个功能?

矢量器.transform()的输出是一个稀疏矩阵,您不能仅将其强制到数据帧中的列中。您可以使用
bag\u of\u word\u feature.toarray()
将其转换为稠密格式,并将其连接到数据帧,但如果您的数据量很大,则可能不建议这样做

下面我使用一些示例数据:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from scipy.sparse import csr_matrix, hstack

twenty_train = fetch_20newsgroups(subset='train',categories=['sci.med','comp.graphics'])

vectorizer = CountVectorizer()
vectorizer.fit(twenty_train.data)
bag_of_word_feature = vectorizer.transform(twenty_train.data)
假设数据帧中还有两个其他数字特征:

avg_train = pd.DataFrame(np.random.randint(0,100,(len(twenty_train.data),2)),
             columns=["avg_word_length", "avg_sent_length"])
我们只需要将其转换为稀疏,然后将其固定:

X_train = hstack([csr_matrix(avg.values),bag_of_word_feature])
y_train = twenty_train.target
regressor = KNeighborsRegressor(10, weights='distance')
regressor.fit(X_train, y_train)

谢谢老兄,我会尽快回复你的
X_train = hstack([csr_matrix(avg.values),bag_of_word_feature])
y_train = twenty_train.target
regressor = KNeighborsRegressor(10, weights='distance')
regressor.fit(X_train, y_train)