解决文本分类中的样本偏差-Python-ADASYN_Python_Sampling_Text Classification

解决文本分类中的样本偏差-Python-ADASYN

python

解决文本分类中的样本偏差-Python-ADASYN,python,sampling,text-classification,Python,Sampling,Text Classification,我目前正在做一个项目，我从一个评论网站上抓取了一个零售商的评论。目的是使用随机森林分类器按主题对数据集中的每个评论进行分类“交付”或“客户服务” 查看数据集后，超过90%的评审（培训和测试数据）与“交付”相关。我的讲师告诉我，我们需要考虑样本偏差。我对此进行了研究，并尝试在下面使用ADASYN（在下面代码的底部附近）在Python中实现一些更正：我一夜之间运行了代码（没有ADASYN函数，代码运行得很快），但没有完成。我正在处理大约32000条评论。运行此操作的目的是为我的示例（“客户服务”）

我目前正在做一个项目，我从一个评论网站上抓取了一个零售商的评论。目的是使用随机森林分类器按主题对数据集中的每个评论进行分类“交付”或“客户服务”

查看数据集后，超过90%的评审（培训和测试数据）与“交付”相关。我的讲师告诉我，我们需要考虑样本偏差。我对此进行了研究，并尝试在下面使用ADASYN（在下面代码的底部附近）在Python中实现一些更正：

我一夜之间运行了代码（没有ADASYN函数，代码运行得很快），但没有完成。我正在处理大约32000条评论。运行此操作的目的是为我的示例（“客户服务”）中代表性不足的类创建虚拟条目，以便更好地训练随机林分类器。目前，我可以盲目预测测试数据中所有评审的“交付”，并且90%以上的时间都是正确的

如果有人能指出我做错了什么，或者Python中有更好的选项，我将不胜感激

谢谢

import pandas as pd

chunksize = 10
TextFileReader = pd.read_csv('TestToSentimentAnalyse.csv', chunksize=chunksize, header=None)
dataset = pd.concat(TextFileReader, ignore_index=False)
dataset.columns = ['Reviews', 'Delivery', 'Customer_Service', 'Purchase_Date', 'Likelihood_to_Recommend',
                   'Overall_Satisfaction', 'Location', 'Date_Published', 'Sentiment']
dataset = dataset.iloc[1:]

# Cleaning the texts
import re

corpus = []
for i in range(1, 29779):
    corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

# Set up dependant variable - delivery is 0, customer service is 1
y = []
for i in range(1, 29779):
    if dataset['Delivery'][i] == '2':
        y.append(1)
    elif dataset['Customer_Service'][i] == '2':
        y.append(0)
    elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '0':
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '1':
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == '0' and dataset['Customer_Service'][i] == '1':
        y.append(0)
    elif dataset['Delivery'][i] == '1' and dataset['Customer_Service'][i] == '0':
        y.append(1)
    elif dataset['Delivery'][i] == 2:
        y.append(1)
    elif dataset['Customer_Service'][i] == 2:
        y.append(0)
    elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 0:
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 1:
        y.append(0.5)  ## flaw in this as we had to choose one
    elif dataset['Delivery'][i] == 0 and dataset['Customer_Service'][i] == 1:
        y.append(0)
    elif dataset['Delivery'][i] == 1 and dataset['Customer_Service'][i] == 0:
        y.append(1)
    else:
        y.append('Needs Review')

get_indexes = lambda y, xs: [i for (j, i) in zip(xs, range(len(xs))) if y == j]
del_idx = get_indexes('Needs Review', y)
del_idx.sort(reverse=True)

import numpy as np

for item in del_idx:
    y = np.delete(y, (item), axis=0)
    X = np.delete(X, (item), axis=0)

from imblearn.over_sampling import ADASYN

ada = ADASYN(random_state=42)
X_ada, y_ada = ada.fit_sample(X, y)

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ada, y_ada, test_size=0.25, random_state=0)

from sklearn.ensemble import RandomForestClassifier

classifier_10E = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
classifier_10E.fit(X_train, y_train)

y_pred_10E = classifier_10E.predict(X_test)