Numpy 线性回归中异常值指标的提取_Numpy_Matplotlib_Statistics_Scipy

Numpy 线性回归中异常值指标的提取

numpy matplotlib statistics

Numpy 线性回归中异常值指标的提取,numpy,matplotlib,statistics,scipy,Numpy,Matplotlib,Statistics,Scipy,下面的脚本计算两个numpy数组（x和y）之间的R平方值由于数据中存在异常值，R平方值非常低。如何提取这些异常值的指数 import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats x = np.random.random_integers(1,50,50) y = np.random.random_integers(1,50,50) r2 = stats.linregress(x, y) [3]**2 print r

下面的脚本计算两个numpy数组（x和y）之间的R平方值

由于数据中存在异常值，R平方值非常低。如何提取这些异常值的指数

import numpy as np, matplotlib.pyplot as plt, scipy.stats as stats

x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)

r2 = stats.linregress(x, y) [3]**2
print r2

plt.scatter(x, y)

plt.show()

异常值定义为：值平均值>2*标准偏差。你可以用这条线来做

[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]

什么是：列表由x的索引构成，其中该索引处的元素满足上述条件

快速测试：

x = np.random.random_integers(1,50,50)

这给了我数组：

array([16,  6, 13, 18, 21, 37, 31,  8,  1, 48,  4, 40,  9, 14,  6, 45, 20,
       15, 14, 32, 30,  8, 19,  8, 34, 22, 49,  5, 22, 23, 39, 29, 37, 24,
       45, 47, 21,  5,  4, 27, 48,  2, 22,  8, 12,  8, 49, 12, 15, 18])

现在我手动添加一些异常值，因为最初没有异常值：

x[4] = 200
x[15] = 178

让我们测试一下：

[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]

结果:

[4, 15]

这就是你要找的吗

编辑：

我在上面一行中添加了abs（）函数，因为当您处理负数时，结果可能会很糟糕。abs（）函数取绝对值。

异常值定义为：值平均值>2*标准偏差。你可以用这条线来做

[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]

什么是：列表由x的索引构成，其中该索引处的元素满足上述条件

快速测试：

x = np.random.random_integers(1,50,50)

这给了我数组：

array([16,  6, 13, 18, 21, 37, 31,  8,  1, 48,  4, 40,  9, 14,  6, 45, 20,
       15, 14, 32, 30,  8, 19,  8, 34, 22, 49,  5, 22, 23, 39, 29, 37, 24,
       45, 47, 21,  5,  4, 27, 48,  2, 22,  8, 12,  8, 49, 12, 15, 18])

现在我手动添加一些异常值，因为最初没有异常值：

x[4] = 200
x[15] = 178

让我们测试一下：

[i for i in range(len(x)) if (abs(x[i] - np.mean(x)) > 2*np.std(x))]

结果:

[4, 15]

这就是你要找的吗

编辑：

我在上面一行中添加了abs（）函数，因为当您处理负数时，结果可能会很糟糕。abs（）函数取绝对值。

我认为桑德的方法是正确的，但是如果你必须在做决定之前看到R2没有这些异常值，这里是一种方法

设置数据并引入异常值：

In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)

x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)   
y[5] = 100

In [3]:
y[r2.argmax()]

Out[3]:
100

计算R2，每次取一个

值（连同匹配的

值）：

获取最大异常值的索引：

r2.argmax()

Out[1]:
5

取出此异常值时获取R2：

In [2]:
r2[r2.argmax()]

Out[2]:
0.85892084723588935

获取异常值的值：

In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)

x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)   
y[5] = 100

In [3]:
y[r2.argmax()]

Out[3]:
100

要获取顶部

异常值：

In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
sorted_index[:n]

Out [4]:
array([ 5, 27, 34,  0, 17], dtype=int64)

我认为Sander的方法是正确的，但是如果你必须在做决定之前看到R2没有这些异常值，那么这里就是一种方法

设置数据并引入异常值：

In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)

x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)   
y[5] = 100

In [3]:
y[r2.argmax()]

Out[3]:
100

计算R2，每次取一个

值（连同匹配的

值）：

获取最大异常值的索引：

r2.argmax()

Out[1]:
5

取出此异常值时获取R2：

In [2]:
r2[r2.argmax()]

Out[2]:
0.85892084723588935

获取异常值的值：

In [1]:
import numpy as np, scipy.stats as stats
np.random.seed(123)

x = np.random.random_integers(1,50,50)
y = np.random.random_integers(1,50,50)   
y[5] = 100

In [3]:
y[r2.argmax()]

Out[3]:
100

要获取顶部

异常值：

In [4]:
n = 5
sorted_index = r2.argsort()[::-1]
sorted_index[:n]

Out [4]:
array([ 5, 27, 34,  0, 17], dtype=int64)

是否有可能获得一个以上（比如5个）最大异常值的索引？

r2

是一个正常的

ndarray

，因此您可以执行所有常规排序、切片…，甚至绘图，以直观地查看r2对于每个值的变化情况。我添加了获取最大异常值的示例。

是否可以获取多个（比如5个）最大异常值的索引？

r2

是一个正常的

ndarray

，因此您可以执行所有常规排序、切片…，甚至绘图，以直观地查看r2对每个值的变化情况。我添加了示例以获取

最大的异常值。