解决文本分类中的样本偏差-Python-ADASYN
我目前正在做一个项目,我从一个评论网站上抓取了一个零售商的评论。目的是使用随机森林分类器按主题对数据集中的每个评论进行分类“交付”或“客户服务” 查看数据集后,超过90%的评审(培训和测试数据)与“交付”相关。我的讲师告诉我,我们需要考虑样本偏差。我对此进行了研究,并尝试在下面使用ADASYN(在下面代码的底部附近)在Python中实现一些更正: 我一夜之间运行了代码(没有ADASYN函数,代码运行得很快),但没有完成。我正在处理大约32000条评论。运行此操作的目的是为我的示例(“客户服务”)中代表性不足的类创建虚拟条目,以便更好地训练随机林分类器。目前,我可以盲目预测测试数据中所有评审的“交付”,并且90%以上的时间都是正确的 如果有人能指出我做错了什么,或者Python中有更好的选项,我将不胜感激 谢谢解决文本分类中的样本偏差-Python-ADASYN,python,sampling,text-classification,Python,Sampling,Text Classification,我目前正在做一个项目,我从一个评论网站上抓取了一个零售商的评论。目的是使用随机森林分类器按主题对数据集中的每个评论进行分类“交付”或“客户服务” 查看数据集后,超过90%的评审(培训和测试数据)与“交付”相关。我的讲师告诉我,我们需要考虑样本偏差。我对此进行了研究,并尝试在下面使用ADASYN(在下面代码的底部附近)在Python中实现一些更正: 我一夜之间运行了代码(没有ADASYN函数,代码运行得很快),但没有完成。我正在处理大约32000条评论。运行此操作的目的是为我的示例(“客户服务”)
import pandas as pd
chunksize = 10
TextFileReader = pd.read_csv('TestToSentimentAnalyse.csv', chunksize=chunksize, header=None)
dataset = pd.concat(TextFileReader, ignore_index=False)
dataset.columns = ['Reviews', 'Delivery', 'Customer_Service', 'Purchase_Date', 'Likelihood_to_Recommend',
'Overall_Satisfaction', 'Location', 'Date_Published', 'Sentiment']
dataset = dataset.iloc[1:]
# Cleaning the texts
import re
corpus = []
for i in range(1, 29779):
corpus.append(review)
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
# Set up dependant variable - delivery is 0, customer service is 1
y = []
for i in range(1, 29779):
if dataset['Delivery'][i] == '2':
y.append(1)
elif dataset['Customer_Service'][i] == '2':
y.append(0)
elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '0':
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '1':
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '1':
y.append(0)
elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '0':
y.append(1)
elif dataset['Delivery'][i] == 2:
y.append(1)
elif dataset['Customer_Service'][i] == 2:
y.append(0)
elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 0:
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 1:
y.append(0.5) ## flaw in this as we had to choose one
elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 1:
y.append(0)
elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 0:
y.append(1)
else:
y.append('Needs Review')
get_indexes = lambda y, xs: [i for (j, i) in zip(xs, range(len(xs))) if y == j]
del_idx = get_indexes('Needs Review', y)
del_idx.sort(reverse=True)
import numpy as np
for item in del_idx:
y = np.delete(y, (item), axis=0)
X = np.delete(X, (item), axis=0)
from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=42)
X_ada, y_ada = ada.fit_sample(X, y)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_ada, y_ada, test_size=0.25, random_state=0)
from sklearn.ensemble import RandomForestClassifier
classifier_10E = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
classifier_10E.fit(X_train, y_train)
y_pred_10E = classifier_10E.predict(X_test)