Python 什么导致算法中的过度拟合_Python_Machine Learning

Python 什么导致算法中的过度拟合

python machine-learning

Python 什么导致算法中的过度拟合,python,machine-learning,Python,Machine Learning,我试图找出是什么原因造成的：尝试更改列车数据集长度删除了一些不需要的字段尝试放置验证块数据集：可能有多种原因导致其过度拟合，也可能有多种方法来调试和修复它。仅从代码很难判断，因为它也取决于数据，但以下是一些常见原因和修复：数据集太小，添加更多数据这是一个常见的过度拟合修复太复杂的模型，若你们有很多特征，或者复杂的polonomial特征，尝试使用特征选择来降低复杂性添加正则化：我在代码中没有看到正则化，请尝试添加它欢迎光临。我已经编辑了你的文章，使它更可读（格式化），并删除

我试图找出是什么原因造成的：

尝试更改列车数据集长度
删除了一些不需要的字段
尝试放置验证块

数据集：

可能有多种原因导致其过度拟合，也可能有多种方法来调试和修复它。仅从代码很难判断，因为它也取决于数据，但以下是一些常见原因和修复：

数据集太小，添加更多数据这是一个常见的过度拟合修复
太复杂的模型，若你们有很多特征，或者复杂的polonomial特征，尝试使用特征选择来降低复杂性
添加正则化：我在代码中没有看到正则化，请尝试添加它

欢迎光临。我已经编辑了你的文章，使它更可读（格式化），并删除了不属于这里的文章。请查看，然后点击我的头像上方的链接“edited…ago”（或上次编辑你文章的其他人的链接），查看编辑历史，这样你就可以看到删除/更改的内容（希望从中学习）。我不知道你的主题，但是一个更好的帖子总是会增加有人回答的机会。谢谢@Anthon。谢谢。你的人际网络似乎很小？学习率为0.05可能偏高，您是否尝试过绘制培训/验证损失图以查看其曲线？为什么批量为43？我们需要您提供适当的详细信息。向我们展示过度拟合的证据、培训统计的历史（损失、准确性等）。让我们对您的代码进行桌面检查并不是吸引回复的方式。：-）

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


#reproducible random seed
seed = 1
np.random.seed(seed)

#Import and normalize the data
df = pd.read_csv('creditcard.csv')


#Exploring the data

# print df.head()
# print df.describe()
# print df.isnull().sum()


# count_class = pd.value_counts(df['Class'])
# count_class.plot(kind = 'bar')
# plt.title('Fraud class histogram')
# plt.xlabel('class')
# plt.ylabel('Frequency')
# plt.show()

# print('Clearly the data is totally unbalanced!')

#to normalize the amount column
# data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))
df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
df = df.drop(['Time','V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8','Amount'], axis =1)
X = df.iloc[:,df.columns!='Class']
Y = df.iloc[:,df.columns=='Class']

# number of records in the minority class
number_record_fraud = len(df[df.Class==1])
fraud_indices = np.array(df[df.Class==1].index)

#picking normal class
normal_indices = np.array(df[df.Class==0].index)

#select random x(number_record_fraud) numbers from normal_indices
random_normal_indices = np.random.choice(normal_indices,number_record_fraud,replace=False)
random_normal_indices = np.array(random_normal_indices)

#under sample data
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
under_sample_data = df.iloc[under_sample_indices,:]

X_undersample = under_sample_data.iloc[:,under_sample_data.columns!='Class']
Y_undersample = under_sample_data.iloc[:,under_sample_data.columns=='Class']

# split data into train and test dataset
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3)
X_train_undersample,X_test_undersample,Y_train_undersample,Y_test_undersample = train_test_split(X_undersample,Y_undersample,test_size=0.3)

#parameters
learning_rate = 0.05
training_epoch = 10
batch_size = 43
display_step = 1

#tf graph input
x = tf.placeholder(tf.float32,[None,18])
y = tf.placeholder(tf.float32,[None,1])

#set model weights
w = tf.Variable(tf.zeros([18,1]))
b = tf.Variable(tf.zeros([1]))

#construct model
pred = tf.nn.softmax(tf.matmul(x,w) + b) #softmax activation

#minimize error using cross entropy
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred),reduction_indices=1))
#Gradient descent
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

#initializing variables
init = tf.global_variables_initializer()

#launch the graph
with tf.Session() as sess:
    sess.run(init)

    #training cycle
    for epoch in range(training_epoch):
        total_batch = len(X_train_undersample)/batch_size
        avg_cost = 0
        #loop over all the batches
        for batch in range(total_batch):
            batch_xs = X_train.iloc[(batch)*batch_size:(batch+1) *batch_size]
            batch_ys = Y_train.iloc[(batch)*batch_size:(batch+1) *batch_size]
            # run optimizer and cost operation
            _,c= sess.run([optimizer,cost],feed_dict={x:batch_xs,y:batch_ys})
            avg_cost += c/total_batch


        correct_prediction = tf.equal(tf.argmax(pred,1),tf.argmax(y,1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

        #disply log per epoch step
        if (epoch+1) % display_step == 0:
            train_accuracy, newCost = sess.run([accuracy, cost], feed_dict={x: X_test,y: Y_test})
            print "test_set_accuracy:",accuracy.eval({x:X_test_undersample,y:Y_test_undersample})*100
            print "whole_set_accuracy:",accuracy.eval({x:X,y:Y})*100
            # print train_accuracy
            # print "cost",newCost
            print

    print 'optimization finished.'