Python 使用tensorflow的多变量线性回归_Python_Machine Learning_Tensorflow_Linear Regression

Python 使用tensorflow的多变量线性回归

python machine-learning tensorflow

Python 使用tensorflow的多变量线性回归,python,machine-learning,tensorflow,linear-regression,Python,Machine Learning,Tensorflow,Linear Regression,我将TensorFlow代码重新用于多变量线性回归，并试图降低成本，但问题是，经过一些迭代后，成本以及W和b的值都变成inf，并且很快变为nan。有人能告诉我问题出在哪里吗。我有大约100000个值。我已经将其裁剪为10000个值进行测试。数据集是这是密码 import numpy as np import tensorflow as tf def computeX(): all_xs = np.loadtxt("test.csv", delimiter=',', skip

我将TensorFlow代码重新用于多变量线性回归，并试图降低成本，但问题是，经过一些迭代后，成本以及W和b的值都变成inf，并且很快变为nan。有人能告诉我问题出在哪里吗。我有大约100000个值。我已经将其裁剪为10000个值进行测试。数据集是

这是密码

import numpy as np
import tensorflow as tf



def computeX():

    all_xs = np.loadtxt("test.csv", delimiter=',', skiprows=1, usecols=range(4,260)) #reads the columns except first one 


    timestamps = np.loadtxt("test.csv", delimiter=',', skiprows=1, usecols=(0),dtype =str)
    symbols = np.loadtxt("test.csv", delimiter=',', skiprows=1, usecols=(1),dtype =float)
    categories = np.loadtxt("test.csv", delimiter=',', skiprows=1, usecols=(2),dtype =str)

    tempList = []
    BOW = {"M1": 1.0, "M5": 2.0, "M15": 3.0, "M30": 4.0, "H1": 5.0, "H4": 6.0, "D1": 7.0}

    #explode dates and make them features.. 2016/11/1 01:54 becomes [2016, 11, 1, 01, 54]
    for i, v in enumerate(timestamps):
        splitted = v.split()
        dateVal = splitted[0]
        timeVal = splitted[1]
        ar = dateVal.split("/")
        splittedTime = timeVal.split(":")

        ar = ar + splittedTime

        Features = np.asarray(ar)
        Features = Features.astype(float)

        # append symbols

        Features = np.append(Features,symbols[i])

        #append categories from BOW

        Features = np.append(Features, BOW[categories[i]] )
        row = np.append(Features,all_xs[i])
        row = row.tolist()
        tempList.append(row)

    all_xs = np.array(tempList)
    del tempList[:]
    return all_xs


if __name__ == "__main__":
    print ("Starting....")


    learn_rate = 0.5

    all_ys = np.loadtxt("test.csv", delimiter=',', skiprows=1, usecols=3) 
#reads only first column  

    all_xs = computeX()

    datapoint_size= int(all_xs.shape[0])

    print(datapoint_size)
    x = tf.placeholder(tf.float32, [None, 263], name="x")
    W = tf.Variable(tf.ones([263,1]), name="W")
    b = tf.Variable(tf.ones([1]), name="b")

    product = tf.matmul(x,W)
    y = product + b

    y_ = tf.placeholder(tf.float32, [datapoint_size])

    cost = tf.reduce_mean(tf.square(y_-y))/ (2*datapoint_size)

    train_step = tf.train.GradientDescentOptimizer(learn_rate).minimize(cost)

    sess = tf.Session()


    init = tf.global_variables_initializer()
    sess.run(init)

    batch_size = 10000
    steps =10
    for i in range(steps):
      print("Entering Loop")
      if datapoint_size == batch_size:
         batch_start_idx = 0
      elif datapoint_size < batch_size:
         raise ValueError("datapoint_size: %d, must be greater than batch_size: %d" % (datapoint_size, batch_size))
      else:
         batch_start_idx = (i * batch_size) % (datapoint_size - batch_size)
      batch_end_idx = batch_start_idx + batch_size
      batch_xs = all_xs[batch_start_idx:batch_end_idx]
      batch_ys = all_ys[batch_start_idx:batch_end_idx]
      xs = np.array(batch_xs)
      ys = np.array(batch_ys)

      feed = { x: xs, y_: ys }

      sess.run(train_step, feed_dict=feed)  
      print("W: %s" % sess.run(W))
      print("b: %f" % sess.run(b))
      print("cost: %f" % sess.run(cost, feed_dict=feed))

将numpy导入为np
导入tensorflow作为tf
def computeX（）：
所有_xs=np.loadtxt（“test.csv”，delimiter='，，skiprows=1，usecols=range（4260））#读取除第一列以外的列
timestaps=np.loadtxt（“test.csv”，分隔符='，'，skiprows=1，usecols=（0），dtype=str）
symbols=np.loadtxt（“test.csv”，分隔符='，'，skiprows=1，usecols=（1），dtype=float）
categories=np.loadtxt（“test.csv”，分隔符='，，skiprows=1，usecols=（2），dtype=str）
圣殿骑士=[]
弓={“M1”：1.0，“M5”：2.0，“M15”：3.0，“M30”：4.0，“H1”：5.0，“H4”：6.0，“D1”：7.0}
#分解日期并使其成为特征。。2016/11/1 01:54成为[2016,11,1,01,54]
对于枚举中的i，v（时间戳）：
splitted=v.split（）
dateVal=已拆分[0]
timeVal=已拆分的[1]
ar=dateVal.split（“/”）
splittedTime=timeVal.split（“：”）
ar=ar+splittedTime
特征=np.asarray（ar）
Features=Features.astype（浮动）
#附加符号
Features=np.append（特征、符号[i]）
#从BOW附加类别
Features=np.append（Features，BOW[categories[i]]）
row=np.append（特性，所有xs[i]）
row=row.tolist（）
tempList.append（第行）
all_xs=np.array（模板列表）
圣殿骑士[：]
返回所有_x
如果名称=“\uuuuu main\uuuuuuuu”：
打印（“开始…”）
学习率=0.5
all_ys=np.loadtxt（“test.csv”，分隔符='，'，skiprows=1，usecols=3）
#只读取第一列
all_xs=computeX（）
datapoint_size=int（所有_xs.shape[0]）
打印（数据点大小）
x=tf.placeholder（tf.float32[None，263]，name=“x”）
W=tf.Variable（tf.ones（[263,1]），name=“W”）
b=tf.Variable（tf.ones（[1]），name=“b”）
产品=tf.matmul（x，W）
y=产品+b
y=tf.placeholder（tf.float32，[datapoint\u size]）
成本=tf.减少平均值（tf.平方（y）-y））/（2*数据点大小）
训练步数=tf.训练.梯度优化（学习率）.最小化（成本）
sess=tf.Session（）
init=tf.global_variables_initializer（）
sess.run（初始化）
批量大小=10000
步骤=10
对于范围内的i（步）：
打印（“进入循环”）
如果数据点大小==批次大小：
批处理\u开始\u idx=0
elif数据点大小<批次大小：
raise VALUERROR（“数据点大小：%d，必须大于批次大小：%d”%（数据点大小，批次大小））
其他：
批处理开始批处理idx=（i*批处理大小）%（数据点大小-批处理大小）
批处理结束批处理idx=批处理开始批处理idx+批处理大小
batch_xs=all_xs[批处理开始\u idx:批处理结束\u idx]
批处理Y=所有Y[批处理开始idx:批处理结束idx]
xs=np.array（批处理xs）
ys=np.数组（批处理）
feed={x:xs，y:ys}
sess.run（列车步进，进料口=进料口）
打印（“W:%s”%s.run（W））
打印（“b:%f”%sess.run（b））
打印（“成本：%f”%sess.run（成本，feed\u dict=feed））

查看您的数据：

id8         id9         id10    id11    id12
1451865600  1451865600  -19.8   87.1    0.5701
1451865600  1451865600  -1.6    3.6     0.57192
1451865600  1451865600  -5.3    23.9    0.57155

您还将权重初始化为1，如果将所有输入数据与1相乘，并将它们相加，则所有“重”列（id8、id9等，带有大数字的列）将从较小的列中挤出数据）。还有用零填充的列：

id236   id237   id238   id239   id240
0       0       0       0       0
0       0       0       0       0
0       0       0       0       0

这些都是不合拍的东西。较大的值将导致非常高的预测，这将导致损失爆炸和溢出。即使将你的学习速度降低10亿倍，也几乎没有任何效果

因此建议：

检查您的数据，去除所有无意义的数据（列中填充了零）
规范化您的输入数据
在这一点上检查量值或损失函数，然后尝试使用学习率

查看您的数据：

id8         id9         id10    id11    id12
1451865600  1451865600  -19.8   87.1    0.5701
1451865600  1451865600  -1.6    3.6     0.57192
1451865600  1451865600  -5.3    23.9    0.57155

id236   id237   id238   id239   id240
0       0       0       0       0
0       0       0       0       0
0       0       0       0       0

这些都是不合拍的东西。较大的值将导致非常高的预测，这将导致损失爆炸和溢出。即使将你的学习速度降低10亿倍，也几乎没有任何效果

因此建议：

检查您的数据，去除所有无意义的数据（列中填充了零）
规范化您的输入数据
在这一点上检查量值或损失函数，然后尝试使用学习率