Python 面向分类问题的多文本列特征提取

Python 面向分类问题的多文本列特征提取,python,machine-learning,nlp,feature-extraction,Python,Machine Learning,Nlp,Feature Extraction,从多个文本列中提取特征并对其应用任何分类算法的正确方法是什么? 如果我做错了,请建议我 示例数据集 自变量:Description1,Description2,State,NumericCol1,NumericCol2 因变量:TargetCategory 代码: ########### Feature Exttraction for Text Data ##################### ######### Description1 (it can be any wordembedd

从多个文本列中提取特征并对其应用任何分类算法的正确方法是什么? 如果我做错了,请建议我

示例数据集

自变量:Description1,Description2,State,NumericCol1,NumericCol2

因变量:TargetCategory

代码:

########### Feature Exttraction for Text Data #####################
######### Description1 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500, 
                              ngram_range = (1,3),
                              stop_words = "english")
X_Description1 = tfidf.fit_transform(df["Description1"].tolist())

######### Description2 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500, 
                              ngram_range = (1,3),
                              stop_words = "english")
X_Description2 = tfidf.fit_transform(df["Description2"].tolist())


######### State (have 100 unique entries thats why used BinaryEncoder)
import category_encoders as ce
binary_encoder= ce.BinaryEncoder(cols=['state'],return_df=True)
X_state = binary_encoder.fit_transform(df["state"])


import scipy
X = scipy.sparse.hstack((X_Description1, 
                         X_Description2,
                         X_state,
                         df[["NumericCol1", "NumericCol2"]].to_numpy())).tocsr()

y = df['TargetCategory']


##### train Test Split ########
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=111)

##### Create Model Model ######
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics 

# Baseline Random forest based Model
rfc = RandomForestClassifier(criterion = 'gini', n_estimators=1000, verbose=1, n_jobs = -1, 
                             class_weight = 'balanced', max_features = 'auto')
rfcg = rfc.fit(X_train,y_train) # fit on training data


####### Prediction ##########
predictions = rfcg.predict(X_test)
print('Baseline: Accuracy: ', round(accuracy_score(y_test, predictions)*100, 2))
print('\n Classification Report:\n', classification_report(y_test,predictions))

在scikit学习中使用多列作为输入的方法是使用

是一个关于如何将其用于异构数据的示例