Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/296.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python scikit learn的多类文本分类软件包中predict()和predict_proba()之间的结果不一致_Python_Machine Learning_Scikit Learn_Nlp - Fatal编程技术网

Python scikit learn的多类文本分类软件包中predict()和predict_proba()之间的结果不一致

Python scikit learn的多类文本分类软件包中predict()和predict_proba()之间的结果不一致,python,machine-learning,scikit-learn,nlp,Python,Machine Learning,Scikit Learn,Nlp,我正在处理一个多类文本分类问题,该问题必须提供前5个匹配项,而不仅仅是最佳匹配项。因此,“成功”被定义为前5名匹配中至少有一项是正确的分类。鉴于我们在上文中对成功的定义,该算法必须至少达到95%的成功率。当然,我们将在数据子集上训练我们的模型,并在剩余子集上进行测试,以验证我们模型的成功 我一直在使用python的scikit learn的predict_proba函数来选择前5个匹配项,并使用自定义脚本计算下面的成功率,该脚本在我的示例数据上运行得似乎很好,但是,我注意到前5个成功率低于前1个

我正在处理一个多类文本分类问题,该问题必须提供前5个匹配项,而不仅仅是最佳匹配项。因此,“成功”被定义为前5名匹配中至少有一项是正确的分类。鉴于我们在上文中对成功的定义,该算法必须至少达到95%的成功率。当然,我们将在数据子集上训练我们的模型,并在剩余子集上进行测试,以验证我们模型的成功

我一直在使用python的scikit learn的predict_proba函数来选择前5个匹配项,并使用自定义脚本计算下面的成功率,该脚本在我的示例数据上运行得似乎很好,但是,我注意到前5个成功率低于前1个成功率,使用.predict在我自己的自定义数据上,这在数学上是不可能的。这是因为排名前5位的结果将自动包含在前5位的结果中,因此成功率必须至少等于前1位的成功率(如果不是更高的话)。为了排除故障,我使用predict和predict_proba比较前1名的成功率,以确保它们相等,并确保前5名的成功率大于前1名

我已经设置了下面的脚本,让您了解我的逻辑,看看我是否在某个地方做出了错误的假设,或者我的数据是否存在需要修复的问题。我正在测试许多分类器和特征,但为了简单起见,您会看到我只是使用计数向量作为特征,使用逻辑回归作为分类器,因为据我所知,我不相信这是问题的一部分。 我非常感谢任何人能够解释我发现这种差异的原因

代码:

使用scikit learn内置的二十个新闻组数据集的输出示例这是我的目标: 注意:我在另一个数据集上运行了这段精确的代码,并且能够生成这些结果,这告诉我函数及其依赖项工作正常,因此问题一定在数据中

Are Top 5 Results greater than Top 1 Result?:  True 
Are Top 1 Results equal from predict() and predict_proba()?:  True  
详情:

现在在我的数据上运行:

TV_model(LogisticRegression(), X_train_counts, train_y_npar, X_test_counts, valid_y_df, valid_x_df)
输出:

详情:

前五名预测准确率_proba=0.6581632653061225 前1名预测准确率_proba=0.201020481632653 前1名准确率=预测=0.8091187478734263
更新:找到了解决方案!显然,索引在某个点被重置。所以我所需要做的就是在测试和训练分离之后重置验证数据集索引

更新代码:

# Set up environment
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

#Read in data and do just a bit of preprocessing

# User's Location of git repository
Git_Location = 'C:/Documents'

# Set Data Location:
data = Git_Location + 'Data.csv'

# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df[['CODE','Description']] #select only these columns
df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})

#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label'].astype('float64', raise_on_error = True)

# drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
df = df.groupby('label').filter(lambda x : len(x)>500)

#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)

#reset the index 
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)

# cast validation datasets to dataframes to allow to merging later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)


# Extracting features from data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x_list)
X_test_counts = count_vect.transform(valid_x_list)

# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_class = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_class_df = pd.DataFrame(data=top_class)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_class_df, left_index=True, right_index=True)


    top5_conditions = [
        (results.iloc[:,0] == results[0]),
        (results.iloc[:,0] == results[1]),
        (results.iloc[:,0] == results[2]),
        (results.iloc[:,0] == results[3]),
        (results.iloc[:,0] == results[4])]
    top5_choices = [1, 1, 1, 1, 1]

    #Top 1 Result
    #top1_conditions = [(results['0_x'] == results[4])]
    top1_conditions = [(results.iloc[:,0] == results[4])]
    top1_choices = [1]

    # Create the success columns
    results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
    results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)

    print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
   print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))

    print(" ")
    print("Details: ")
    print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions)) 

欢迎来到SO;请看,以及为什么。谢谢!只是把它缩短了一点。。。。无论如何,我会尽我所能。嗨,罗摩阿尔德。好主意,但因为top_n_预测数组是按降序排序的,所以最后一个值的概率最高。我确实用这个更改重新运行了脚本,只是为了确保predict和predict_proba?:QA的前1个结果与predict和predict_proba?:QA在正在工作的数据上更改为False。哎哟,对不起,这个错误,我正在删除我的评论。注意:如果您的验证数据集的传统索引为0到n,则问题中列出的原始代码实际上可以工作。只有过滤数据集或对数据集重新排序时,它才会出现故障,比如从上面使用的测试/验证拆分函数派生数据集,或者过滤掉空值。这就是为什么我们需要从0开始重置索引。
TV_model(LogisticRegression(), X_train_counts, train_y_npar, X_test_counts, valid_y_df, valid_x_df)
Are Top 5 Results greater than Top 1 Result?:  False 
Are Top 1 Results equal from predict() and predict_proba()?:  False   
# Set up environment
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

#Read in data and do just a bit of preprocessing

# User's Location of git repository
Git_Location = 'C:/Documents'

# Set Data Location:
data = Git_Location + 'Data.csv'

# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df[['CODE','Description']] #select only these columns
df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})

#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label'].astype('float64', raise_on_error = True)

# drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
df = df.groupby('label').filter(lambda x : len(x)>500)

#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)

#reset the index 
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)

# cast validation datasets to dataframes to allow to merging later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)


# Extracting features from data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x_list)
X_test_counts = count_vect.transform(valid_x_list)

# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_class = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_class_df = pd.DataFrame(data=top_class)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_class_df, left_index=True, right_index=True)


    top5_conditions = [
        (results.iloc[:,0] == results[0]),
        (results.iloc[:,0] == results[1]),
        (results.iloc[:,0] == results[2]),
        (results.iloc[:,0] == results[3]),
        (results.iloc[:,0] == results[4])]
    top5_choices = [1, 1, 1, 1, 1]

    #Top 1 Result
    #top1_conditions = [(results['0_x'] == results[4])]
    top1_conditions = [(results.iloc[:,0] == results[4])]
    top1_choices = [1]

    # Create the success columns
    results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
    results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)

    print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
   print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))

    print(" ")
    print("Details: ")
    print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))