Python 如何使用sklearn';返回前N个预测的准确率;s SGDClassizer?
我试图修改这篇文章中的结果(如何使用sklearn的SGDClassizer获得前3名或前N名预测),以获得返回的准确率,但是我得到的准确率为零,我不知道为什么。有什么想法吗?如有任何想法/编辑,将不胜感激!多谢各位Python 如何使用sklearn';返回前N个预测的准确率;s SGDClassizer?,python,scikit-learn,tf-idf,Python,Scikit Learn,Tf Idf,我试图修改这篇文章中的结果(如何使用sklearn的SGDClassizer获得前3名或前N名预测),以获得返回的准确率,但是我得到的准确率为零,我不知道为什么。有什么想法吗?如有任何想法/编辑,将不胜感激!多谢各位 from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from sklearn import linear_model arr=['dogs cats lions','apple
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
def top_n_accuracy(probs, test, n):
best_n = np.argsort(probs, axis=1)[:,-n:]
ts = np.argmax(test, axis=1)
successes = 0
for i in range(ts.shape[0]):
if ts[i] in best_n[i,:]:
successes += 1
return float(successes)/ts.shape[0]
n=2
probs = clf.predict_proba(test)
top_n_accuracy(probs, test, n)
这里我介绍了地面真值标签向量(这些是数字索引,您需要将[“元素”等]映射到[0,1,2等]。这里我假设您的测试示例属于元素
y_true = np.array([2,1])
这样就可以计算出你的准确度
np.mean(np.array([1 if y_true[k] in topn[k] else 0 for k in range(len(topn))]))
我最终明白了这一点,尽管与上面的有点不同
# Set Data Location:
data = 'top10000.csv'
# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df.dropna()
df = df[['CODE','DUTIES']] #select only these columns
#df = df.rename(index=float, columns={"CODE": "label", "DUTIES": "text"})
df = df.rename(columns={"CODE": "label", "DUTIES": "text"})
#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label']=df['label'].str.replace('.', '',regex=True)
#df['label']=pd.to_numeric(df['label'])
df['label']=df['label'].str[1:].astype(int)
#df['label'].astype('float64', raise_on_error = True)
#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6)
#reset the index
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)
# We will also copy the validation datasets to a dataframe to be able to merge later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)
# Extracte features
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x)
X_test_counts = count_vect.transform(valid_x)
# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the top n labels on validation dataset
n = 5
#classifier.probability = True
probas = classifier.predict_proba(feature_vector_valid)
predictions = classifier.predict(feature_vector_valid)
#Identify the indexes of the top predictions
top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
#then find the associated SOC code for each prediction
top_class = classifier.classes_[top_n_predictions]
#cast to a new dataframe
top_class_df = pd.DataFrame(data=top_class)
#merge it up with the validation labels and descriptions
results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
results = pd.merge(results, top_class_df, left_index=True, right_index=True)
# Top 5 results condiions and choices
top5_conditions = [
(results.iloc[:,0] == results[0]),
(results.iloc[:,0] == results[1]),
(results.iloc[:,0] == results[2]),
(results.iloc[:,0] == results[3]),
(results.iloc[:,0] == results[4])]
top5_choices = [1, 1, 1, 1, 1]
# Fetch Top 1 Result
top1_conditions = [(results.iloc[:,0] == results[4])]
top1_choices = [1]
# Create the success columns
results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)
#Print the QA
print("Are Top 5 Results greater than Top 1 Result? (answer must be True): ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
print("Are Top 1 Results equal from predict() and predict_proba()? (answer must be True): ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))
print(" ")
print("Details: ")
print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
#print("Top 5 Accuracy Rate (np.mean)= ", np.mean(np.array([1 if valid_y[k] in top_class[k] else 0 for k in range(len(top_class))])))
print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate = (predict)", metrics.accuracy_score(valid_y, predictions))
# Train and validate model from example data using the function defined above
TV_model(LogisticRegression(), X_train_counts, train_y, X_test_counts, valid_y_df, valid_x_df)
我相信它在计算效率上会更高,因此,如果您能建议我如何将准确率计算转换为上述评论中建议的一行,我将不胜感激!我正在努力理解您在这里要做什么。什么是“准确率”对你来说?为什么每个类的预测概率不够?我需要能够通过这样说来帮助报告我的分类器的准确性,例如,如果我的测试数据集中有100个案例,并且我的分类器在猜测80个案例时是正确的(意味着每个案例在前n个结果中至少有一个匹配),那么准确率将是80%。我想我的问题更多:你如何确定在你的设置中什么是成功的分类?为什么你需要这个前n?啊,我明白了。所以成功的分类将是,如果前n个值中至少有一个匹配,它将是成功的。这回答了你的问题吗?我正在创建一个recommender,我的客户只关心我的建议中是否至少有一个适合当前的分类问题。好吧,这是有意义的。还有一件事,在评估模型时,你需要一些基本事实标签。你有这些标签吗?在我看来,不应该使用ts,你应该拥有T类所属的信息检查一下这个“真”的标签是否出现在前n个预测中感谢你的回答!不过我不太懂逻辑。你为什么要计算平均值?平均值计算有多少成功的预测(根据你的定义)除以样本量。因此,如果你10个预测中有8个是正确的,你将有80%的准确率。哈哈!谢谢你@maximeKan!非常感谢。嗨@maximeKan,不幸的是,这个计算给了我一个0的结果。但是这个df选项给了我正确的结果…@Statmonger,上面的代码是有效的,我不明白为什么它会返回n有你意想不到的事吗
# Set Data Location:
data = 'top10000.csv'
# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df.dropna()
df = df[['CODE','DUTIES']] #select only these columns
#df = df.rename(index=float, columns={"CODE": "label", "DUTIES": "text"})
df = df.rename(columns={"CODE": "label", "DUTIES": "text"})
#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label']=df['label'].str.replace('.', '',regex=True)
#df['label']=pd.to_numeric(df['label'])
df['label']=df['label'].str[1:].astype(int)
#df['label'].astype('float64', raise_on_error = True)
#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6)
#reset the index
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)
# We will also copy the validation datasets to a dataframe to be able to merge later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)
# Extracte features
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x)
X_test_counts = count_vect.transform(valid_x)
# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# predict the top n labels on validation dataset
n = 5
#classifier.probability = True
probas = classifier.predict_proba(feature_vector_valid)
predictions = classifier.predict(feature_vector_valid)
#Identify the indexes of the top predictions
top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]
#then find the associated SOC code for each prediction
top_class = classifier.classes_[top_n_predictions]
#cast to a new dataframe
top_class_df = pd.DataFrame(data=top_class)
#merge it up with the validation labels and descriptions
results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
results = pd.merge(results, top_class_df, left_index=True, right_index=True)
# Top 5 results condiions and choices
top5_conditions = [
(results.iloc[:,0] == results[0]),
(results.iloc[:,0] == results[1]),
(results.iloc[:,0] == results[2]),
(results.iloc[:,0] == results[3]),
(results.iloc[:,0] == results[4])]
top5_choices = [1, 1, 1, 1, 1]
# Fetch Top 1 Result
top1_conditions = [(results.iloc[:,0] == results[4])]
top1_choices = [1]
# Create the success columns
results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)
#Print the QA
print("Are Top 5 Results greater than Top 1 Result? (answer must be True): ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
print("Are Top 1 Results equal from predict() and predict_proba()? (answer must be True): ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))
print(" ")
print("Details: ")
print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
#print("Top 5 Accuracy Rate (np.mean)= ", np.mean(np.array([1 if valid_y[k] in top_class[k] else 0 for k in range(len(top_class))])))
print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate = (predict)", metrics.accuracy_score(valid_y, predictions))
# Train and validate model from example data using the function defined above
TV_model(LogisticRegression(), X_train_counts, train_y, X_test_counts, valid_y_df, valid_x_df)