如何在Tensorflow中通过Softmax回归读取csv文件和训练数据
我刚开始研究Tensorflow,在训练数据时有一个问题。 我的问题是读取csv文件,然后使用softmax分类根据学生的学习时间和课堂出勤率来估计学生的成绩(A、B或C) 我定义csv文件,然后将其加载为如何在Tensorflow中通过Softmax回归读取csv文件和训练数据,tensorflow,tensorflow-serving,tensor,prettytensor,Tensorflow,Tensorflow Serving,Tensor,Prettytensor,我刚开始研究Tensorflow,在训练数据时有一个问题。 我的问题是读取csv文件,然后使用softmax分类根据学生的学习时间和课堂出勤率来估计学生的成绩(A、B或C) 我定义csv文件,然后将其加载为 COLUMNS = ["studytime", "attendance", "A", "B", "C"] FEATURES = ["studytime", "attendance"] LABEL = ["A", "B", "C"] training_set = pd.read_csv("h
COLUMNS = ["studytime", "attendance", "A", "B", "C"]
FEATURES = ["studytime", "attendance"]
LABEL = ["A", "B", "C"]
training_set = pd.read_csv("hw1.csv", skipinitialspace=True,
skiprows=1, names=COLUMNS)
之后,我为特征和标签定义了张量,如下所示
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
labels = [tf.contrib.layers.real_valued_column(k) for k in LABEL]
然后,我按照以下方法在以下位置使用MNIST数据训练softmax:
但是我不知道如何定义batch_xs和batch_ys来在这个循环中进行训练
for _ in range(1000):
batch_xs=????
batch_ys=????
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
我如何定义函数来估计三个学生的学习和注意力时间,例如[11,7],[3,4],[1,0]的分数
你能帮我解决这个问题吗
提前感谢,看起来您正在将CSV读入数据框?您当然可以通过这种方式手动实现批处理过程,但TF中有一种有效的内置方法来构建队列和批处理。这有点复杂,但它可以很好地按顺序或通过随机洗牌服务行,这非常方便。只需确保所有行的长度都相等,这样就可以轻松指定哪些行表示XE,哪些行表示Ys 这需要的两个函数是
tf.decode\u csv
和tf.train.shuffle\u batch
(或者tf.train.batch
,如果不需要随机洗牌)
我们在这篇文章中详细讨论了这一点,其中包括一个完整的工作代码示例:
看起来您的数据都是数字,Ys是一种热格式,因此MNIST示例对于实现估算函数应该很有用
***更新:
这大致是操作顺序:
1.定义链接示例中所示的两个函数——一个用于逐行读取CSV文件,另一个用于将这些行中的每一行打包为N个批次(随机或顺序)
2.在不协调的情况下通过启动读取循环。是否应停止():
此循环将一直运行,直到耗尽您提供给队列的所有CSV文件的内容
3.在循环的每次迭代中,对这些变量执行sess.run
,可以获得批量的Xs和Ys,以及CSV文件每一行中可能需要的任何额外元类型内容,例如本例中的日期标签(在您的情况下,它可能是学生姓名或其他内容:
dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])
当TF到达文件的末尾时,它将抛出一个异常,这就是为什么上面的所有代码都在try/catch块中——通过捕获该异常,您就知道您已经完成了
上述功能为您提供了对CSV文件的逐单元访问,并允许您将CSV文件批处理为N个批次、所需的纪元数等
*****更新2**
以下是完整的代码,可以按您拥有的格式分批读取CSV文件。它只需打印每个批的内容。从这里,您可以轻松地将此代码与实际执行培训/等等的代码连接起来
import tensorflow as tf
fileName = 'data/study.csv'
try_epochs = 1
batch_size = 3
S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label
# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]
# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
return studentLbl, features, label
# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True) # this refers to multiple files, not line items within files
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000 # min of where to start loading into memory
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
# this packs the above lines into a batch of size you specify:
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size, try_epochs)
with tf.Session() as sess:
gInit = tf.global_variables_initializer().run()
lInit = tf.local_variables_initializer().run()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# load student-label, features, and label as a batch:
studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])
print(studentLbl_batch);
print(feature_batch);
print(label_batch);
print('----------');
except tf.errors.OutOfRangeError:
print("Done looping through the file")
finally:
coord.request_stop()
coord.join(threads)
假设您的CSV文件如下所示:
name studytime attendance A B C
S1 2 1 0 1 0
S2 3 2 1 0 0
S3 4 3 0 0 1
S4 3 5 0 0 1
S5 4 4 0 1 0
S6 2 1 1 0 0
上述代码应打印以下输出:
[[b'S5']
[b'S6']
[b'S3']]
[[ 4. 4.]
[ 2. 1.]
[ 4. 3.]]
[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]
----------
[[b'S2']
[b'S1']
[b'S4']]
[[ 3. 2.]
[ 2. 1.]
[ 3. 5.]]
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
----------
Done looping through the file
因此,与其打印批次的内容,不如简单地使用它们作为X和Y,以便在
提要中进行培训。以下是我的尝试。但精确度并不像我预期的那样高
import tensorflow as tf
fileName = 'hw1.csv'
try_epochs = 1
batch_size = 8
S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label
# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]
# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
return studentLbl, features, label
# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True) # this refers to multiple files, not line items within files
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000 # min of where to start loading into memory
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
# this packs the above lines into a batch of size you specify:
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size,
try_epochs)
x = tf.placeholder(tf.float32, [None, 2])
W = tf.Variable(tf.zeros([2, 3]))
b = tf.Variable(tf.zeros([3]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 3])
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
with tf.Session() as sess:
gInit = tf.global_variables_initializer().run()
lInit = tf.local_variables_initializer().run()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# load student-label, features, and label as a batch:
studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])
print(studentLbl_batch);
print(feature_batch);
print(label_batch);
print('----------');
batch_xs = feature_batch
batch_ys = label_batch
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) # feeding data
except tf.errors.OutOfRangeError:
print("Done looping through the file")
finally:
coord.request_stop()
coord.join(threads)
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: feature_batch, y_: label_batch}))
print(sess.run(W))
print(sess.run(b))
准确性
0.375
W、 b
谢谢你的建议,VS_FF先生。我刚刚读了你提到的上述帖子,我理解了一些要点,但实际上对我来说似乎很复杂。在我的问题中,你说我需要使用tf.train.batch,但我仍然不清楚它如何应用于定义batch_xs和batch_ys?你能更清楚地告诉我吗?因为在MNIST中,它们是sed代码mnist.train.next_batch(),但在我的问题中,我不知道如何修改它来应用我的案例。我更新了原始答案,让您概括了解这些代码的作用。在MNIST示例中,它们的作用不同,但我发现这种方法特别适用于读取CSV文件,特别是对于随机批次,尤其是如果您有多个您想将多个CSV文件混洗在一起。先生:谢谢您的更新。我刚刚阅读了您提到的上述代码。当然,我直接运行它时只需更改文件名并将putTS=2。但是,您的代码在我的情况下无法读取CSV文件。我发现您的代码在到达最后一行之前运行良好:coord.join(threads)。其中一些错误,如:“StringToNumberOp无法正确转换字符串”。你能帮我弄清楚吗?在那张便条中提到的一件事是,在我的例子中,每一行都包含字符串和数字。因为你需要为每一行中的每个单元格提供默认值,以便TF读取该行,所以更容易将默认值作为字符串提供,然后将必要的单元格转换为数字。不常见的情况是t不会以另一种方式工作(即,如果您提供默认值作为浮点,但有一些字符串单元格,TF将抛出一个错误)。因此,如果您的所有数据都是数字,您可以跳过整个逻辑,将其作为浮点或整数读取。但是关于您的错误,我怀疑您得到它是因为您达到了某种EOF,或者您有一些字符无法识别,或者可能是编码问题。正如我所说,这个过程有点粗糙,因为它表示EOF是一个你必须处理的异常。也许这与此有关?如果不知道CSV的内容,很难说
[[ 0.00555556 0.00972222 -0.01527778] [ 0.00555556 0.01388889 -0.01944444]]
[-0.00277778 0.00138889 0.00138889]