Python 二维阵列的最小二乘回归_Python_Arrays_Numpy_Regression_Linear Algebra

Python 二维阵列的最小二乘回归

python arrays numpy

Python 二维阵列的最小二乘回归,python,arrays,numpy,regression,linear-algebra,Python,Arrays,Numpy,Regression,Linear Algebra,numpy.linalg.lstsq（a，b）函数接受一个具有大小的数组a，以及一个一维数组b，该数组是因变量如果数据点显示为从图像文件生成的2d数组，我将如何进行最小二乘回归？该阵列的外观如下所示： [[0, 0, 0, 0, e] [0, 0, c, d, 0] [b, a, f, 0, 0]] 其中a、b、c、d、e、f为正整数值我想在这些点上画一条线。我可以使用np.linalg.lstsq（如果可以，如何使用）还是有更合理的方法（如果可以，如何使用）非常感谢。有一次我看到一

numpy.linalg.lstsq（a，b）

函数接受一个具有大小的数组

，以及一个一维数组

，该数组是因变量

如果数据点显示为从图像文件生成的2d数组，我将如何进行最小二乘回归？该阵列的外观如下所示：

[[0, 0, 0, 0, e]
 [0, 0, c, d, 0]
 [b, a, f, 0, 0]]

其中

a、b、c、d、e、f

为正整数值

我想在这些点上画一条线。我可以使用

np.linalg.lstsq

（如果可以，如何使用）还是有更合理的方法（如果可以，如何使用）

非常感谢。

有一次我看到一个类似的python程序

# Prac 2 for Monte Carlo methods in a nutshell
# Richard Chopping, ANU RSES and Geoscience Australia, October 2012
# Useage
# python prac_q2.py [number of bootstrap runs]
# e.g. python prac_q2.py 10000
# would execute this and perform 10 000 bootstrap runs.
# Default is 100 runs.

# sys cause I need to access the arguments the script was called with
import sys
# math cause it's handy for scalar maths
import math
# time cause I want to benchmark how long things take
import time
# numpy cause it gives us awesome array / matrix manipulation stuff
import numpy
# scipy just in case
import scipy
# scipy.stats to make life simpler statistcally speaking
import scipy.stats as stats

def main():
    print "Prac 2 solution: no graphs"
    true_model = numpy.array([17.0, 10.0, 1.96])

    # Here's a nifty way to write out numpy arrays.
    # Unlike the data table in the prac handouts, I've got time first
    # and height second.
    # You can mix up the order but you need to change a lot of calculations
    # to deal with this change.
    data = numpy.array([[1.0, 26.94],
                        [2.0, 33.45],
                        [3.0, 40.72],
                        [4.0, 42.32],
                        [5.0, 44.30],
                        [6.0, 47.19],
                        [7.0, 43.33],
                        [8.0, 40.13]])
    # Perform the least squares regression to find the best fit solution
    best_fit = regression(data)
    # Nifty way to get out elements from an array
    m1,m2,m3 = best_fit
    print "Best fit solution:"
    print "m1 is", m1, "and m2 is", m2, "and m3 is", m3

    # Calculate residuals from the best fit solution
    best_fit_resid = residuals(data, best_fit)

    print "The residuals from the best fit solution are:"
    print best_fit_resid
    print ""

    # Bootstrap part
    # --------------
    # Number of bootstraps to run. 100 is a minimum and our default number.
    num_booties = 100
    # If we have an argument to the python script, use this as the
    # number of bootstrap runs
    if len(sys.argv) > 1:
        num_booties = int(sys.argv[1])

    # preallocate an array to store the results.
    ensemble = numpy.zeros((num_booties, 3))

    print "Starting up the bootstrap routine"

    # How to do timing within a Python script - here I start a stopwatch running
    start_time = time.clock()
    for index in range(num_booties):
        # Print every 10 % so we know where we're up to in long runs
        if print_progress(index, num_booties):
            percent = (float(index) / float(num_booties)) * 100.0
            print "Have completed", percent, "percent"

        # For each iteration of the bootstrap algorithm,
        # first calculate mixed up residuals...
        resamp_resid = resamp_with_replace(best_fit_resid)
        # ... then generate new data...
        new_data = calc_new_data(data, best_fit, resamp_resid)
        # ... then perform another regression to generate a new set of m1, m2, m3 
        bootstrap_model = regression(new_data)
        ensemble[index] = (bootstrap_model[0], bootstrap_model[1], bootstrap_model[2])
        # Done with the loop
    # Calculate the time the run took - what's the current time, minus when we started.
    loop_time = time.clock() - start_time

    print ""

    print "Ensemble calculated based on", num_booties, "bootstrap runs."
    print "Bootstrap runs took", loop_time, "seconds."
    print ""

    # Stats on the ensemble time
    # --------------------------
    B = num_booties

    # Mean is pretty simple, 1.0/B to force it to use floating points
    # This gives us an array of the means of the 3 model parameters
    mean = 1.0/B * numpy.sum(ensemble, axis=0)
    print "Mean is ([m1 m2 m3]):", mean

    # Variance
    var2 = 1.0/B * numpy.sum(((ensemble - mean)**2), axis=0)
    print "Variance squared is ([m1 m2 m3]):", var2
    # Bias
    bias = mean - best_fit
    print "Bias is ([m1 m2 m3]):", bias
    bias_corr = best_fit - bias
    print "Bias corrected solution is ([m1 m2 m3]):", bias_corr
    print "The original solution was ([m1 m2 m3]):", best_fit
    print "And the true solution is ([m1 m2 m3]):", true_model

    print ""

    # Confidence intervals
    # ---------------------
    # Sort column 1 to calculate confidence intervals
    # Sorting in numpy sucks.
    # Need to declare what the fields are (so it knows how to sort it)
    #   f8 => numpy's floating point number
    # Then need to delcare what we sort it on
    # Here we sort on the first column, then the second, then the third.
    #   f0,f1,f2 field 0, then field 1, then field 2.
    # Then we make sure we sort it by column (axis = 0)
    # Then we take a view of that data as a float64 so it works properly
    sorted_m1 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f0','f1','f2'], axis=0).view(numpy.float64)

    # stats is my name for scipy.stats
    # This has a wonderful function that calculates percentiles, including performing interpolation
    # (important for low numbers of bootstrap runs)
    m1_perc0p5 = stats.scoreatpercentile(sorted_m1,0.5)[0]
    m1_perc2p5 = stats.scoreatpercentile(sorted_m1,2.5)[0]
    m1_perc16 = stats.scoreatpercentile(sorted_m1,16)[0]
    m1_perc84 = stats.scoreatpercentile(sorted_m1,84)[0]
    m1_perc97p5 = stats.scoreatpercentile(sorted_m1,97.5)[0]
    m1_perc99p5 = stats.scoreatpercentile(sorted_m1,99.5)[0]
    print "m1 68% confidence interval is from", m1_perc16, "to", m1_perc84
    print "m1 95% confidence interval is from", m1_perc2p5, "to", m1_perc97p5
    print "m1 99% confidence interval is from", m1_perc0p5, "to", m1_perc99p5
    print ""

    # Now column 2, sort it...
    sorted_m2 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f1','f0','f2'], axis=0).view(numpy.float64)
    # ... and do stats.
    m2_perc0p5 = stats.scoreatpercentile(sorted_m2,0.5)[1]
    m2_perc2p5 = stats.scoreatpercentile(sorted_m2,2.5)[1]
    m2_perc16 = stats.scoreatpercentile(sorted_m2,16)[1]
    m2_perc84 = stats.scoreatpercentile(sorted_m2,84)[1]
    m2_perc97p5 = stats.scoreatpercentile(sorted_m2,97.5)[1]
    m2_perc99p5 = stats.scoreatpercentile(sorted_m2,99.5)[1]
    print "m2 68% confidence interval is from", m2_perc16, "to", m2_perc84
    print "m2 95% confidence interval is from", m2_perc2p5, "to", m2_perc97p5
    print "m2 99% confidence interval is from", m2_perc0p5, "to", m2_perc99p5
    print ""

    # and finally column 3, again, sort it..
    sorted_m3 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f2','f1','f0'], axis=0).view(numpy.float64)
    # ... and do stats.
    m3_perc0p5 = stats.scoreatpercentile(sorted_m3,0.5)[1]
    m3_perc2p5 = stats.scoreatpercentile(sorted_m3,2.5)[1]
    m3_perc16 = stats.scoreatpercentile(sorted_m3,16)[1]
    m3_perc84 = stats.scoreatpercentile(sorted_m3,84)[1]
    m3_perc97p5 = stats.scoreatpercentile(sorted_m3,97.5)[1]
    m3_perc99p5 = stats.scoreatpercentile(sorted_m3,99.5)[1]
    print "m3 68% confidence interval is from", m3_perc16, "to", m3_perc84
    print "m3 95% confidence interval is from", m3_perc2p5, "to", m3_perc97p5
    print "m3 99% confidence interval is from", m3_perc0p5, "to", m3_perc99p5
    print ""
    # End of the main function


#
#   
# Helper functions go down here
#   
#   


# regression
# This takes a 2D numpy array and performs a least-squares regression
# using the formula on the practical sheet, page 3
# Stored in the top are the real values
# Returns an array of m1, m2 and m3.
def regression(data):
    # While testing, just return the real values
    # real_values = numpy.array([17.0, 10.0, 1.96])

    # Creating the G matrix
    # ---------------------
    # Because I'm using numpy arrays here, we need
    # to learn some notation.
    # data[:,0] is the FIRST column
    # Length of this = number of time samples in data
    N = len(data[:,0])

    # numpy.sum adds up all data in a row or column.
    # Axis = 0 implies add up each column. [0] at end
    # returns the sum of the first column
    # This is the sum of Ti for i = 1..N
    sum_Ti = numpy.sum(data, axis=0)[0]

    # numpy.power takes each element of an array and raises them to a given power
    # In this one call we also take the sum of the columns (as above) after they have
    # been squared, and then just take the t column
    sum_Ti2 = numpy.sum(numpy.power(data, 2), axis=0)[0]

    # Now we need to get the cube of Ti, then sum that result
    sum_Ti3 = numpy.sum(numpy.power(data, 3), axis=0)[0]

    # Finally we need the quartic of Ti, then sum that result
    sum_Ti4 = numpy.sum(numpy.power(data, 4), axis=0)[0]

    # Now we can construct the G matrix
    G = numpy.array([[N, sum_Ti, -0.5 * sum_Ti2],
                        [sum_Ti, sum_Ti2, -0.5 * sum_Ti3],
                        [-0.5 * sum_Ti2, -0.5 * sum_Ti3, 0.25 * sum_Ti4]])
    # We also need to take the inverse of the G matrix
    G_inv = numpy.linalg.inv(G)


    # Creating the d matrix
    # ---------------------
    # Hello numpy.sum, my old friend...
    sum_Yi = numpy.sum(data, axis=0)[1]

    # numpy.prod multiplies the values in an array.
    # We need to do the products along axis 1 (i.e. row by row)
    # Then sum all the elements
    sum_TiYi = numpy.sum(numpy.prod(data, axis=1))

    # The final element we need is a bit tricky.
    # We need the product as above
    TiYi = numpy.prod(data, axis=1)
    # Then we get tricky. * works how we need it here,
    # remember that the Ti column is referenced by data[:,0] as above
    Ti2Yi = TiYi * data[:,0]
    # Then we sum
    sum_Ti2Yi = numpy.sum(Ti2Yi)

    #With all the elements, we make the d matrix
    d = numpy.array([sum_Yi,
                    sum_TiYi,
                    -0.5 * sum_Ti2Yi])

    # Do the linear algebra stuff
    # To multiple numpy arrays in a matrix style,
    # we need to use numpy.dot()
    # Not the most useful notation, but there you go.
    # To help out the Matlab users: http://www.scipy.org/NumPy_for_Matlab_Users
    result = G_inv.dot(d)

    #Return this result
    return result

# residuals:
# Takes in a data array, and an array of best fit paramers
# calculates the difference between the observed and predicted data
# and returns an array
def residuals(data, best_fit):
    # Extract ti from the data array
    ti = data[:,0]
    # We also need an array of the square of ti
    ti2 = numpy.power(ti, 2)

    # Extract yi
    yi = data[:,1]

    # Calculate residual (data minus predicted)
    result = yi - best_fit[0] - (best_fit[1] * ti) + (0.5 * best_fit[2] * ti2)

    return result

# resamp_with_replace:
# Perform a dataset resampling with replacement on parameter set.
# Uses numpy.random to generate the random numbers to pick the indices to look up.
# So for item 0, ... N, we look up a random index from the set and put that in
# our resampled data.
def resamp_with_replace(set):
    # How many things do we need to do this for?
    N = len(set)

    # Preallocate our result array
    result = numpy.zeros(N)

    # Generate N random integers between 0 and N-1
    indices = numpy.random.randint(0, N - 1, N)

    # For i from the set 0...N-1 (that's what the range() command gives us),
    # our result for that i is given by the index we randomly generated above
    for i in range(N):
        result[i] = set[indices[i]]

    return result

# calc_new_data:
# Given a set of resampled residuals, use the model parameters to derive
# new data. This is used for bootstrapping the residuals.
# true_data is a numpy array of rows of ti, yi. We only need the ti column though.
# model is an array of three parameters, corresponding to m1, m2, m3.
# residuals are an array of our resudials
def calc_new_data(true_data, model, residuals):
    # Extract the time information from the new data array
    ti = true_data[:,0]

    # Calculate new data using array maths
    # This goes through and does the sums etc for each element of the array
    # Nice and compact way to represent it.
    y_new = residuals + model[0] + (model[1] * ti) - (0.5 * model[2] * ti**2)

    # Our result needs to be an array of ti, y_new, so we need to combine them using
    # the numpy.column_stack routine
    result = numpy.column_stack((ti, y_new))

    # Return this combined array
    return result

# print_progress:
# Just a quick thing that returns true if we want to print for this index
# and false otherwise
def print_progress(index, total):
    index = float(index)
    total = float(total)

    result = False

    # Floating point maths is irritating
    # We want to print at the start, every 10%, and at the end.
    # This works up to index = 100,000
    # Would also be lovely if Python had a switch statement
    if (((index / total) * 100) <= 0.00001):
        result = True
    elif (((index / total) * 100) >= 9.99999) and (((index / total) * 100) <= 10.00001):
        result = True
    elif (((index / total) * 100) >= 19.99999) and (((index / total) * 100) <= 20.00001):
        result = True
    elif (((index / total) * 100) >= 29.99999) and (((index / total) * 100) <= 30.00001):
        result = True
    elif (((index / total) * 100) >= 39.99999) and (((index / total) * 100) <= 40.00001):
        result = True
    elif (((index / total) * 100) >= 49.99999) and (((index / total) * 100) <= 50.00001):
        result = True
    elif (((index / total) * 100) >= 59.99999) and (((index / total) * 100) <= 60.00001):
        result = True
    elif (((index / total) * 100) >= 69.99999) and (((index / total) * 100) <= 70.00001):
        result = True
    elif (((index / total) * 100) >= 79.99999) and (((index / total) * 100) <= 80.00001):
        result = True
    elif (((index / total) * 100) >= 89.99999) and (((index / total) * 100) <= 90.00001):
        result = True
    elif ((((index+1) / total) * 100) > 99.99999):
        result = True
    else:
        result = False

    return result

#
#   
# End of helper functions
#
#

# So we can easily execute our script
if __name__ == "__main__":
    main()

简而言之，蒙特卡罗方法的Prac 2 #Richard Chopping，澳大利亚国立大学和澳大利亚地球科学院，2012年10月 #使用 #python prac_q2.py[引导运行的次数] #例如python prac_q2.py 10000 #将执行此操作并执行10000次引导运行。 #默认值为100次。 #sys因为我需要访问调用脚本时使用的参数导入系统 #数学因为它对于标量数学很方便输入数学 #时间，因为我想确定事情需要多长时间导入时间 #numpy因为它给了我们很棒的数组/矩阵操作的东西进口numpy #以防万一进口西皮 #scipy.stats使生活从统计角度来说更简单将scipy.stats导入为stats def main（）：打印“Prac 2解决方案：无图表” true_model=numpy.array（[17.0,10.0,1.96]） #这里有一个写numpy数组的好方法。 #与prac讲义中的数据表不同，我有时间优先 #身高第二。 #你可以混淆顺序，但你需要改变很多计算 #为了应对这种变化。 data=numpy.array（[[1.0,26.94]， [2.0, 33.45], [3.0, 40.72], [4.0, 42.32], [5.0, 44.30], [6.0, 47.19], [7.0, 43.33], [8.0, 40.13]]) #执行最小二乘回归以找到最佳拟合解决方案最佳拟合=回归（数据） #从数组中取出元素的漂亮方法 m1、m2、m3=最佳拟合打印“最佳解决方案：” 打印“m1是”，m1，“m2是”，m2，“m3是”，m3 #从最佳拟合解计算残差最佳拟合残差=残差（数据，最佳拟合）打印“最佳拟合解决方案的残差为：打印最佳拟合剩余打印“” #引导部分 # -------------- #要运行的引导数。100是最小值，也是我们的默认值。战利品数量=100 #如果我们有python脚本的参数，请将其用作 #引导运行数如果len（sys.argv）>1： num_booties=int（sys.argv[1]） #预先分配一个数组以存储结果。合奏=numpy.zero（（num_booties，3））打印“启动引导例程” #如何在Python脚本中计时-这里我开始运行秒表开始时间=time.clock（）对于范围内的索引（num_booties）： #每10%打印一次，这样我们就可以知道我们的长期目标如果打印进度（索引、数量）：百分比=（浮动（指数）/浮动（数量战利品））*100.0 打印“已完成”，百分比，“百分比” #对于bootstrap算法的每次迭代， #首先计算混合残差。。。重新映射剩余=使用替换重新映射剩余（最佳匹配剩余） # ... 然后生成新数据。。。新数据=计算新数据（数据、最佳拟合、重新映射剩余） # ... 然后执行另一次回归，生成一组新的m1、m2、m3 自举模型=回归（新数据）集合[索引]=（自举模型[0]，自举模型[1]，自举模型[2]） #完成循环 #计算跑步花费的时间-当前时间是多少，减去我们开始跑步的时间。循环时间=time.clock（）-开始时间打印“” 打印“基于计算的集合”，num_booties，“引导运行” 打印“引导运行时间”，循环时间，秒数打印“” #合奏时间统计 # -------------------------- B=数量 #Mean非常简单，1.0/B强制它使用浮点 #这为我们提供了3个模型参数的平均值数组平均值=1.0/B*numpy.和（系综，轴=0）打印“平均值为（[m1 m2 m3]）：”，平均值 #差异 var2=1.0/B*numpy.sum（（（系综-平均值）**2），轴=0）打印“方差平方为（[m1 m2 m3]）：”，var2 #偏倚偏差=平均值-最佳拟合打印“偏差为（[m1 m2 m3]）：”，偏差偏差校正=最佳拟合-偏差打印“偏差校正溶液为（[m1 m2 m3]）：”，偏差校正打印“原始溶液为（[m1 m2 m3]）：”，最佳匹配打印“并且真实的解决方案是（[m1 m2 m3]）：”，真实的模型打印“” #置信区间 # --------------------- #对列1排序以计算置信区间 #在numpy中排序很糟糕。 #需要声明字段是什么（以便它知道如何排序） #f8=>numpy的浮点数 #那么我们需要关心我们的分类 #在这里，我们对第一列进行排序，然后是第二列，然后是第三列。 #f0、f1、f2字段0，然后是字段1，然后是字段2。 #然后确保按列对其进行排序（轴=0） #然后我们将该数据视为一个float64，以便它正常工作排序的_m1=numpy.sort（集成视图（'f8，f8，f8'），顺序=['f0'，'f1'，'f2'，]，轴=0）。视图（numpy.float64） #stats是我对scipy.stats的名字 #这有一个很好的计算百分位数的函数，包括执行插值 #（对于少量引导运行很重要） m1_perc0p5=统计分数百分位数（排序为m1,0.5）[0] m1_perc2p5=统计分数百分位数（排序为m1,2.5）[0] m1_perc16=统计分数百分位数（已排序的_m1,16）[0] m1_perc84=统计分数百分位数（已排序的_m1,84）[0] m1_perc97p5=统计

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

clf.coef_