Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/variables/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 找出数据集中哪些要素是共线的_Python_Python 3.x_Machine Learning_Statistics_Statsmodels - Fatal编程技术网

Python 找出数据集中哪些要素是共线的

Python 找出数据集中哪些要素是共线的,python,python-3.x,machine-learning,statistics,statsmodels,Python,Python 3.x,Machine Learning,Statistics,Statsmodels,我已经建立了一个模型,根据多个特征预测房价 import statsmodels.api as statsmdl from sklearn import datasets X = data[['NumberofRooms', 'YearBuilt','Type','NewConstruction'] y = data["Price"] model = statsmdl.OLS(y, X).fit() predictions = model.predict(X) model.summary()

我已经建立了一个模型,根据多个特征预测房价

import statsmodels.api as statsmdl
from sklearn import datasets

X = data[['NumberofRooms', 'YearBuilt','Type','NewConstruction']
y = data["Price"]

model = statsmdl.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
如何确定这些功能中哪些是共线的?

您可以使用该方法

演示:

[27]中的
:df=pd.DataFrame(np.random.randint(10,size=(5,3)),columns=list('abc'))
在[28]中:df['d']=df['a']*10-df['b']/np.pi
在[29]中:df['e']=np.log(df['c']**2)
在[30]中:c=df.corr()
In[31]:c
出[31]:
a、b、c、d、e
a 1.0000000.734858 0.113787 0.999837 0.067358
b 0.734858 1.000000-0.523635 0.722485-0.598739
c 0.113787-0.5236351.0000000.129945 0.984257
d 0.999837 0.722485 0.129945 1.000000 0.084615
e 0.067358-0.598739 0.984257 0.084615 1.000000
In[32]:c[c>=0.7]
出[32]:
a、b、c、d、e
a 1.0000000.734858纳南0.999837纳南
b 0.734858 1.000000纳南0.722485纳南
c NaN NaN 1.000000 NaN 0.984257
d 0.999837 0.722485纳南1.000000纳南
鄂南0.984257南1.000000
[33]:c[c>=0.7].stack().reset_index(name='cor').query(“abs(cor)<1.0”)
出[33]:
级别0级别1 cor
1 a b 0.734858
2 a d 0.999837
3 b a 0.734858
5 b d 0.722485
7 c e 0.984257
8 d a 0.999837
9 d b 0.722485
11东经0.984257

这只会给出相关性,而不是涉及两个以上变量的共线性。@约瑟夫,说得好,谢谢!我没有想到多重共线性。。。在这种情况下,您会使用什么?用于顺序识别的vif或QR。在奇异矩阵情况下使用QR有一个stackoverflow答案,我现在找不到。如果涉及两个以上的变量,那么要确定哪些变量集是相关的就更难了,因为我们只有一个子空间,例如两个特征值接近零的特征向量。(对于多重共线性度量的statsmodels,有几个开放的、当前处于休眠状态的pull请求,例如)
In [27]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=list('abc'))

In [28]: df['d'] = df['a'] * 10 - df['b'] / np.pi

In [29]: df['e'] = np.log(df['c'] **2)

In [30]: c = df.corr()

In [31]: c
Out[31]:
          a         b         c         d         e
a  1.000000  0.734858  0.113787  0.999837  0.067358
b  0.734858  1.000000 -0.523635  0.722485 -0.598739
c  0.113787 -0.523635  1.000000  0.129945  0.984257
d  0.999837  0.722485  0.129945  1.000000  0.084615
e  0.067358 -0.598739  0.984257  0.084615  1.000000

In [32]: c[c >= 0.7]
Out[32]:
          a         b         c         d         e
a  1.000000  0.734858       NaN  0.999837       NaN
b  0.734858  1.000000       NaN  0.722485       NaN
c       NaN       NaN  1.000000       NaN  0.984257
d  0.999837  0.722485       NaN  1.000000       NaN
e       NaN       NaN  0.984257       NaN  1.000000

In [33]: c[c >= 0.7].stack().reset_index(name='cor').query("abs(cor) < 1.0")
Out[33]:
   level_0 level_1       cor
1        a       b  0.734858
2        a       d  0.999837
3        b       a  0.734858
5        b       d  0.722485
7        c       e  0.984257
8        d       a  0.999837
9        d       b  0.722485
11       e       c  0.984257