没有显式数字的初学者Python OLS,即DataFrame

没有显式数字的初学者Python OLS,即DataFrame,python,numpy,regression,Python,Numpy,Regression,我需要帮助理解在Python中运行OLS(或任何机器学习)。我已经安装了所有相关的软件包,如pandas、numpy、statsmodels、scipy等 以下是我的基本示例: df3= DataFrame({'revenue':[5,7,4,5,3,6,4,7,4,8,3,4],'cost':[2,4,4,3,6,7,5,4,7,23,4,7], 'overhead':[3,4,5,6,4,3,4,5,4,3,4,5]}) df3 df3.loc[0,'cost'] = 4 df3 df3.l

我需要帮助理解在Python中运行OLS(或任何机器学习)。我已经安装了所有相关的软件包,如pandas、numpy、statsmodels、scipy等

以下是我的基本示例:

df3= DataFrame({'revenue':[5,7,4,5,3,6,4,7,4,8,3,4],'cost':[2,4,4,3,6,7,5,4,7,23,4,7], 'overhead':[3,4,5,6,4,3,4,5,4,3,4,5]})
df3
df3.loc[0,'cost'] = 4
df3
df3.loc[12]=[1,5,8]
df3
好了,因为我的数据框中有额外的行,所以我不想把自变量和因变量复制到我的回归公式中

OLS回归公式

df3= pd.DataFrame({"cost":[#Numbers would go here], "overhead":[#Numbers would go here], "revenue": [#Numbers would go here]})
reg = ols (y=df3["cost"], x=df3[["overhead","revenue"]])
reg
print(df3.to_csv(columns=['cost'], sep='\t', index=False))
因此,我使用这个csv公式从数据框中获取各个列,这样我就可以将它们复制到excel中,然后将它们复制回我的回归公式中进行求解。但是,如果我只想使用Python,而不必在它和其他软件之间来回复制和粘贴,该怎么办呢


在OLS回归公式中,是否有没有其他软件可以引用我的“成本”、“间接费用”和“收入”数据,而不必显式输入每个数字

为了回答我自己的问题,我就是这么做的。 我使用矩阵函数并将数据帧转换为数组

numpyMatrix=df3.as_矩阵()

纽曼矩阵

kk=np.数组(numpyMatrix)

kk

然后打印出每一列数据,复制粘贴到我自己的OLS公式中

kk[:,0]

kk[:,1]

kk[:,2]

df3=pd.DataFrame({“成本”:[4,4,4,3,6,7,5,4,7,23,4,7,1],“开销”:[3,4,5,6,4,3,4,5,4,3,4,5],“收入”:[5,7,4,5,3,6,4,7,4,8])

reg=ols(y=df3[“成本”],x=df3[“管理费用”,“收入”])

注册

您只需将数组输出复制并粘贴到OLS公式中即可求解。 我不认为这是解决这个问题的最好办法,但至少我不需要使用任何其他软件

再次感谢各位用户, 我希望听到你的评论

这是全部打印件。对不起,有什么困惑 谢谢你抽出时间

import pandas as pd
from pandas import DataFrame
print ('Pandas Version:' + pd.__version__)
from pandas.stats.api import ols
import numpy as np
print ('numpy Version:' + np.__version__)
from numpy import array
from numpy import array
from numpy import mean
from numpy import median
from numpy import std
from numpy import var
from numpy import amin
from numpy import amax
from numpy import nanmin
from numpy import nanmax
from numpy import ptp
from numpy import percentile
from numpy import average
from numpy import nanmean
from numpy import nanstd
from numpy import nanvar
from numpy import corrcoef
from numpy import correlate
from numpy import cov
from numpy import histogram
from numpy import histogram2d
from numpy import histogramdd
from numpy import bincount
from numpy import digitize
import collections
import math
import scipy.stats
import statsmodels.api as sm
from scipy import stats
import statsmodels as sm
import pylab as pl
from numpy.random import rand
from numpy.random import randn
from numpy.random import randint
from numpy.random import random_integers
from numpy.random import random_sample
from numpy.random import random
from numpy.random import ranf
from numpy.random import sample
from numpy.random import choice
from numpy.random import bytes
from numpy.random import shuffle
from numpy.random import permutation
from numpy.random import beta
from numpy.random import binomial
from numpy.random import chisquare
from numpy.random import dirichlet
from numpy.random import exponential
from numpy.random import f
from numpy.random import gamma
from numpy.random import geometric
from numpy.random import gumbel
from numpy.random import hypergeometric
from numpy.random import laplace
from numpy.random import logistic
from numpy.random import lognormal
from numpy.random import logseries
from numpy.random import multinomial
from numpy.random import multivariate_normal
from numpy.random import negative_binomial
from numpy.random import noncentral_chisquare
from numpy.random import noncentral_f
from numpy.random import normal
from numpy.random import pareto
from numpy.random import poisson
from numpy.random import power
from numpy.random import rayleigh
from numpy.random import standard_cauchy
from numpy.random import standard_exponential
from numpy.random import standard_gamma
from numpy.random import standard_normal
from numpy.random import standard_t
from numpy.random import triangular
from numpy.random import uniform
from numpy.random import vonmises
from numpy.random import wald
from numpy.random import weibull
from numpy.random import zipf
from numpy.random import RandomState
from numpy.random import seed
from numpy.random import get_state
from numpy.random import set_state
from __future__ import print_function
import numpy as np
import statsmodels.api as sm
from scipy import stats
from matplotlib import pyplot as plt
import statsmodels.api as sm
from numpy import array
from numpy import mean
from numpy import median
import collections
import math
from pandas.stats.api import ols




df3= DataFrame({'revenue':[5,7,4,5,3,6,4,7,4,8,3,4],'cost':[2,4,4,3,6,7,5,4,7,23,4,7], 'overhead':[3,4,5,6,4,3,4,5,4,3,4,5]})
df3
Out[62]:
cost    overhead    revenue
0   2   3   5
1   4   4   7
2   4   5   4
3   3   6   5
4   6   4   3
5   7   3   6
6   5   4   4
7   4   5   7
8   7   4   4
9   23  3   8
10  4   4   3
11  7   5   4
In [63]:

df3.loc[0,'cost'] = 4
df3
Out[63]:
cost    overhead    revenue
0   4   3   5
1   4   4   7
2   4   5   4
3   3   6   5
4   6   4   3
5   7   3   6
6   5   4   4
7   4   5   7
8   7   4   4
9   23  3   8
10  4   4   3
11  7   5   4
In [64]:

df3.loc[12]=[1,5,8]
df3
Out[64]:
cost    overhead    revenue
0   4   3   5
1   4   4   7
2   4   5   4
3   3   6   5
4   6   4   3
5   7   3   6
6   5   4   4
7   4   5   7
8   7   4   4
9   23  3   8
10  4   4   3
11  7   5   4
12  1   5   8
In [30]:

df3.iloc[:,[0]]
df3.iloc[:,[1]]
df3.iloc[:,[2]]
Out[30]:
revenue
0   5
1   7
2   4
3   5
4   3
5   6
6   4
7   7
8   4
9   8
10  3
11  4
12  8
In [72]:

​
​
​
​

In [70]:

jk=df3.iloc[:,[0]]
df3.ix[:,1]
Out[70]:
0     3
1     4
2     5
3     6
4     4
5     3
6     4
7     5
8     4
9     3
10    4
11    5
12    5
Name: overhead, dtype: int64
In [80]:

numpyMatrix = df3.as_matrix()
numpyMatrix
Out[80]:
array([[ 4,  3,  5],
       [ 4,  4,  7],
       [ 4,  5,  4],
       [ 3,  6,  5],
       [ 6,  4,  3],
       [ 7,  3,  6],
       [ 5,  4,  4],
       [ 4,  5,  7],
       [ 7,  4,  4],
       [23,  3,  8],
       [ 4,  4,  3],
       [ 7,  5,  4],
       [ 1,  5,  8]], dtype=int64)
In [75]:

print (df3.to_csv(columns=['cost'], sep='\t', index=False))
cost
4
4
4
3
6
7
5
4
7
23
4
7
1

In [94]:

kk=np.array(numpyMatrix)
kk
Out[94]:
array([[ 4,  3,  5],
       [ 4,  4,  7],
       [ 4,  5,  4],
       [ 3,  6,  5],
       [ 6,  4,  3],
       [ 7,  3,  6],
       [ 5,  4,  4],
       [ 4,  5,  7],
       [ 7,  4,  4],
       [23,  3,  8],
       [ 4,  4,  3],
       [ 7,  5,  4],
       [ 1,  5,  8]], dtype=int64)
In [100]:

kk[:,0]
Out[100]:
array([ 4,  4,  4,  3,  6,  7,  5,  4,  7, 23,  4,  7,  1], dtype=int64)
In [101]:

kk[:,1]
Out[101]:
array([3, 4, 5, 6, 4, 3, 4, 5, 4, 3, 4, 5, 5], dtype=int64)
In [102]:

kk[:,2]
Out[102]:
array([5, 7, 4, 5, 3, 6, 4, 7, 4, 8, 3, 4, 8], dtype=int64)
In [103]:

df3= pd.DataFrame({"cost":[4,  4,  4,  3,  6,  7,  5,  4,  7, 23,  4,  7,  1], "overhead":[3, 4, 5, 6, 4, 3, 4, 5, 4, 3, 4, 5, 5], "revenue": [5, 7, 4, 5, 3, 6, 4, 7, 4, 8, 3, 4, 8]})
reg = ols (y=df3["cost"], x=df3[["overhead","revenue"]])
reg
Out[103]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <overhead> + <revenue> + <intercept>

Number of Observations:         13
Number of Degrees of Freedom:   3

R-squared:         0.3185
Adj R-squared:     0.1822

Rmse:              4.8625

F-stat (2, 10):     2.3363, p-value:     0.1470

Degrees of Freedom: model 2, resid 10

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
      overhead    -2.8085     1.5201      -1.85     0.0944    -5.7878     0.1709
       revenue     0.7575     0.7885       0.96     0.3594    -0.7880     2.3029
     intercept    13.9969     8.0440       1.74     0.1125    -1.7694    29.7631
---------------------------------End of Summary---------------------------------
将熊猫作为pd导入
从导入数据帧
打印('PANDES版本:'+pd.\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
从pandas.stats.api导入ols
将numpy作为np导入
打印('numpy版本:'+np.\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu版本)
从numpy导入数组
从numpy导入数组
从numpy进口的意思
从numpy导入中值
从numpy导入标准
从numpy导入变量
从numpy进口阿明
从numpy导入amax
从numpy进口南民
来自numpy进口nanmax
从numpy导入ptp
从numpy导入百分比
从numpy进口平均值
从numpy进口Nanmeian
来自numpy进口公司
来自numpy import nanvar
来自numpy import Corrcof
从numpy导入关联
来自numpy进口cov
从numpy导入直方图
从numpy导入Historogram2d
从numpy导入Historogramdd
从numpy导入bincount
从numpy导入数字化
导入集合
输入数学
导入scipy.stats
将statsmodels.api作为sm导入
从scipy导入统计信息
将statsmodels导入为sm
将pylab作为pl导入
从numpy.random导入rand
从numpy.random导入randn
从numpy.random导入randint
从numpy.random导入随机_整数
从numpy.random导入随机样本
从numpy.random导入random
从numpy.random导入ranf
来自numpy.random导入示例
从numpy.random导入选择
从numpy.random导入字节
从numpy.random导入shuffle
从numpy.random导入置换
从numpy.random导入测试版
从numpy.random导入二项式
从numpy.random导入chisquare
从numpy.random导入dirichlet
从numpy.random导入
从numpy.random导入f
从numpy.random导入gamma
从numpy.random导入几何体
从numpy.random导入gumbel
从numpy.random导入超几何
从numpy.random导入拉普拉斯
从numpy随机导入物流
从numpy.random导入lognormal
从numpy.random导入日志系列
从numpy.random导入多项式
从numpy.random导入多变量_normal
从numpy.random导入负二项
来自numpy.random import非中心_chisquare
从numpy.random导入非中心\u f
从numpy.random导入普通
从numpy.random导入pareto
从numpy.random导入泊松
来自numpy.random import power
从numpy.random导入rayleigh
来自numpy.random导入标准\u cauchy
来自numpy.random导入标准\u
从numpy.random导入标准\u gamma
从numpy.random导入标准\u normal
来自numpy.random导入标准\u t
从numpy.random导入三角形
从numpy.random导入制服
从numpy.random导入vonmises
从numpy.random导入wald
从numpy.random导入威布尔
从numpy.random导入zipf
从numpy.random导入随机状态
从numpy.random导入种子
从numpy.random导入获取\u状态
从numpy.random导入集合\u状态
来自未来导入打印功能
将numpy作为np导入
将statsmodels.api作为sm导入
从scipy导入统计信息
从matplotlib导入pyplot作为plt
将statsmodels.api作为sm导入
从numpy导入数组
从numpy进口的意思
从numpy导入中值
导入集合
输入数学
从pandas.stats.api导入ols
df3=数据帧({‘收入’:[5,7,4,5,3,6,4,7,4,8,3,4],‘成本’:[2,4,4,3,6,7,5,4,7,23,4,7],‘开销’:[3,4,5,6,4,3,4,5])
df3
出[62]:
成本管理费收入
0   2   3   5
1   4   4   7
2   4   5   4
3   3   6   5
4   6   4   3
5   7   3   6
6   5   4   4
7   4   5   7
8   7   4   4
9   23  3   8
10  4   4   3
11  7   5   4
在[63]中:
df3.loc[0,'cost']=4
df3
出[63]:
成本管理费收入
0   4   3   5
1   4   4   7
2   4   5   4
3   3   6   5
4   6   4   3
5   7   3   6
6   5   4   4
7   4   5   7
8   7   4   4
9   23  3   8
10  4   4   3
11  7   5   4
在[64]中:
df3.loc[12]=[1,5,8]
df3
出[64]:
成本管理费收入
0   4   3   5
1   4   4   7
2   4   5   4
3   3   6   5
4   6   4   3
5   7   3   6
6   5   4   4
7   4   5   7
8   7   4   4
9   23  3   8
10  4   4   3
11  7   5   4
12  1   5   8
在[30]中:
df3.iloc[:,[0]]
df3.iloc[:,[1]]
df3.iloc[:,[2]]
出[30]:
收入
0   5
1   7
2   4
3   5
4   3
5   6
6   4
7   7
8   4
9   8
10  3
11  4
12  8
在[72]中:
​
​
​
​
在[70]中:
J