Python sklearn将文本序列转换为稀疏矩阵，然后缩放数值，然后合并为单个X_Python_Pandas_Scikit Learn

Python sklearn将文本序列转换为稀疏矩阵，然后缩放数值，然后合并为单个X

python pandas scikit-learn

Python sklearn将文本序列转换为稀疏矩阵，然后缩放数值，然后合并为单个X,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,如果我有文本值和数字值，并且我想：将文本转换为数字（我使用CountVectorizer作为一个通用示例）将数字数据转换为相同的比例将1和2组合成一个X矩阵传递给估计器如何将稀疏矩阵和numpy数组组合成单个X，同时在处理大型稀疏矩阵时注意内存限制以下是一个数据帧示例： df = pd.DataFrame({ 'Term': [ 'johns company', 'johns company home', 'home repair', 'home rem

如果我有文本值和数字值，并且我想：

将文本转换为数字（我使用

CountVectorizer

作为一个通用示例）

将数字数据转换为相同的比例

将1和2组合成一个

矩阵传递给估计器

如何将稀疏矩阵和numpy数组组合成单个

，同时在处理大型稀疏矩阵时注意内存限制

以下是一个数据帧示例：

df = pd.DataFrame({
    'Term': [ 'johns company', 'johns company home', 'home repair',
            'home remodeling', 'johns company home repair system',
            'home repair systems', 'home systems', 'repair a home',
            'home remodeling ideas', 'home repair system'],
    'Metric1': [ 319434, 21644, 113185, 73210, 8907, 23016, 36789, 48025, 29624,
               6944],
    'Metric2': [13270, 5015, 4301, 3722, 2502, 2190, 1934, 2468, 2706, 904],
    'Metric3': [ 24170.83, 11034.36, 24137.57, 16548.53, 4777.27, 9565.45,
               8014.29, 9041.97, 7612.31, 4045.37],
    'Metric4': [1.0, 1.1, 2.9, 2.7, 1.1, 2.0, 3.0, 1.9, 1.6, 1.5],
    'y': [712, 406, 297, 215, 190, 0, 125, 100, 94, 93]
    }, columns=['Term', 'Metric1', 'Metric2', 'Metric3', 'Metric4', 'y'])

## df looks like this
                               Term  Metric1  Metric2   Metric3  Metric4    y
0                     johns company   319434    13270  24170.83      1.0  712
1                johns company home    21644     5015  11034.36      1.1  406
2                       home repair   113185     4301  24137.57      2.9  297
3                   home remodeling    73210     3722  16548.53      2.7  215
4  johns company home repair system     8907     2502   4777.27      1.1  190
5               home repair systems    23016     2190   9565.45      2.0    0
6                      home systems    36789     1934   8014.29      3.0  125
7                     repair a home    48025     2468   9041.97      1.9  100
8             home remodeling ideas    29624     2706   7612.31      1.6   94
9                home repair system     6944      904   4045.37      1.5   93

我的目的是将文本转换为数字

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
text_features = cv.fit_transform(df['Term'])
text_features
<10x8 sparse matrix of type '<class 'numpy.int64'>'
    with 27 stored elements in Compressed Sparse Row format>

我在这里的意图是加入

text\u功能

和

num\u功能

，努力使一个

传递给估计器

from sklearn.pipeline import FeatureUnion
fu = FeatureUnion([('text', text_features), ('num', num_features)])
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(fu, df['y'])
Traceback (most recent call last):
  File "<pyshell#230>", line 1, in <module>
    lr.fit(fu, df['y'])
  File "C:\Python34\lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
    y_numeric=True, multi_output=True)
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 510, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array
    array = array.astype(np.float64)
TypeError: float() argument must be a string or a number, not 'FeatureUnion'

来自sklearn.pipeline导入功能联合的


fu=特征联合（[（'text'，text\u features），（'num'，num\u features）]）
从sklearn.linear\u模型导入线性回归
lr=线性回归（）
lr.fit（fu，df['y']）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
lr.fit（fu，df['y']）
文件“C:\Python34\lib\site packages\sklearn\linear\u model\base.py”，第427行，在fit中
y_数值=真，多输出=真）
文件“C:\Python34\lib\site packages\sklearn\utils\validation.py”，第510行，检查
确保\u最小\u功能，警告\u数据类型，估计器）
文件“C:\Python34\lib\site packages\sklearn\utils\validation.py”，第393行，在check\u数组中
array=array.astype（np.float64）
TypeError:float（）参数必须是字符串或数字，而不是“FeatureUnion”

FeatureUnion

我应该在这里使用它将文本和数字数据连接到一个

矩阵吗？

我想你误解了

FeatureUnion

的工作原理

FeatureUnion

应用多个特征提取器/预处理器，并将生成的特征组合到单个矩阵中。因为您没有多个预处理器，而是有多个矩阵，所以您可能应该使用

hstack

。使用

numpy.hstack（）

需要两个密集矩阵。如果需要使用稀疏，请使用

scipy.sparse.hstack（）

。

谢谢您消除了我的困惑。我现在看到

np.hstack（（text\u features.todense（），num\u features））

确实创建了一个

。返回一个矩阵。或者，

np.hstack（（text\u features.toarray（），num\u features））

返回一个ndarray。

from sklearn.pipeline import FeatureUnion
fu = FeatureUnion([('text', text_features), ('num', num_features)])
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(fu, df['y'])
Traceback (most recent call last):
  File "<pyshell#230>", line 1, in <module>
    lr.fit(fu, df['y'])
  File "C:\Python34\lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
    y_numeric=True, multi_output=True)
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 510, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 393, in check_array
    array = array.astype(np.float64)
TypeError: float() argument must be a string or a number, not 'FeatureUnion'