Python 如何将TfIdfvectorizer与其余列组合
我试图在Python中的一列上运行Tf Idf,并希望将输出与数据帧中的其余列相结合,以便将其提供给分类器。我对异构数据使用了功能联合,但由于某些原因,我不断地出错。我正在使用以下代码:Python 如何将TfIdfvectorizer与其余列组合,python,pipeline,heterogeneous,Python,Pipeline,Heterogeneous,我试图在Python中的一列上运行Tf Idf,并希望将输出与数据帧中的其余列相结合,以便将其提供给分类器。我对异构数据使用了功能联合,但由于某些原因,我不断地出错。我正在使用以下代码: pipecols1=[col for col in dftrf.columns if col!='Name_x'] pipecols2=['Name_x'] class MySelector(BaseEstimator, TransformerMixin): def __init__(self, key)
pipecols1=[col for col in dftrf.columns if col!='Name_x']
pipecols2=['Name_x']
class MySelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
var= Pipeline([
('var', MySelector(key=pipecols1))])
text= Pipeline([
('text', MySelector(key=pipecols2) ),
('tfidf', TfidfVectorizer())])
feats = FeatureUnion(transformer_list=[('var',var),
('text',text)],transformer_weights=
{'var':1,'text':1})
feature_processing = Pipeline([('feats', feats)])
feature_processing.fit(x,y)
我不断得到以下错误:
ValueError Traceback (most recent call
last)
<ipython-input-61-b17725dbe418> in <module>
----> 1 feature_processing.fit_transform(dftrf)
~/.conda/envs/test_py3/lib/python3.6/site-packages/sklearn/pipeline.py in
fit_transform(self, X, y, **fit_params)
298 Xt, fit_params = self._fit(X, y, **fit_params)
299 if hasattr(last_step, 'fit_transform'):
--> 300 return last_step.fit_transform(Xt, y, **fit_params)
301 elif last_step is None:
302 return Xt
~/.conda/envs/test_py3/lib/python3.6/site-packages/sklearn/pipeline.py in
fit_transform(self, X, y, **fit_params)
799 self._update_transformer_list(transformers)
800 if any(sparse.issparse(f) for f in Xs):
--> 801 Xs = sparse.hstack(Xs).tocsr()
802 else:
803 Xs = np.hstack(Xs)
~/.local/lib/python3.6/site-packages/scipy/sparse/construct.py in
hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
~/.local/lib/python3.6/site-packages/scipy/sparse/construct.py in
bmat(blocks, format, dtype)
584
exp=brow_lengths[i],
585
got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got
blocks[0,1].shape[0] == 1, expected 999000.
ValueError回溯(最近的调用)
最后)
在里面
---->1特征处理。拟合变换(dftrf)
~/.conda/envs/test_py3/lib/python3.6/site-packages/sklearn/pipeline.py in
拟合变换(自、X、y、**拟合参数)
298 Xt,拟合参数=自拟合(X,y,**拟合参数)
299如果hasattr(最后一步“拟合变换”):
-->300返回最后一步。拟合变换(Xt,y,**拟合参数)
301如果最后一步为无:
302返回文本
~/.conda/envs/test_py3/lib/python3.6/site-packages/sklearn/pipeline.py in
拟合变换(自、X、y、**拟合参数)
799自我更新变压器列表(变压器)
800如果有(对于Xs中的f,稀疏.issparse(f)):
-->801 Xs=sparse.hstack(Xs.tocsr())
802其他:
803xs=np.hstack(Xs)
中的~/.local/lib/python3.6/site-packages/scipy/sparse/construct.py
hstack(块、格式、数据类型)
463
464 """
-->465返回bmat([blocks],format=format,dtype=dtype)
466
467
中的~/.local/lib/python3.6/site-packages/scipy/sparse/construct.py
bmat(块、格式、数据类型)
584
exp=眉毛长度[i],
585
got=A.shape[0]))
-->586提升值错误(msg)
587
588如果bcol_长度[j]==0:
ValueError:块[0,:]具有不兼容的行维度。获取
块[0,1]。形状[0]==1,应为999000。
pipecols2是我的文本列
pipecols1是我想要在不进行转换的情况下组合的列
有什么想法吗