Python 如何用最近值替换离群值,如Matlab中的filloutlier函数?

Python 如何用最近值替换离群值,如Matlab中的filloutlier函数?,python,matlab,numpy,dynamic-programming,Python,Matlab,Numpy,Dynamic Programming,我需要复制Matlab中存在的(someList,'nearest','mean')函数 我有下面的代码,它基本上是正确的。但是,当我给它数据集时,它会替换错误的值。它将453.675231替换为0,而不是-211.71818100000002。我尝试过用一系列不同的方法来更改比较IGHBORS函数,但我真的不知道此时该怎么做 我将添加数据,这样您就可以复制和粘贴它,它应该可以工作。如果我在compareNeighbors中切换,则此函数适用于此示例,但不适用于其他示例 import numpy

我需要复制Matlab中存在的(someList,'nearest','mean')函数

我有下面的代码,它基本上是正确的。但是,当我给它数据集时,它会替换错误的值。它将453.675231替换为0,而不是-211.71818100000002。我尝试过用一系列不同的方法来更改
比较IGHBORS
函数,但我真的不知道此时该怎么做

我将添加数据,这样您就可以复制和粘贴它,它应该可以工作。如果我在
compareNeighbors
中切换,则此函数适用于此示例,但不适用于其他示例

import numpy as np
from math import sqrt
from statistics import stdev as std

def compareNeighbors(before, current, after):
    valBefore = (before - current)
    valAfter = (after - current)

    print(valBefore)
    print(valAfter)

    return(valBefore < valAfter) 

def findNearestValue(data, before, current, after):
    before = before if before > -1 else 0
    after = after if after < len(data) else len(data) - 1

    valBefore = data[before] if before != current else 10000000000
    valAfter = data[after] if after != current else 10000000000

    return valBefore if compareNeighbors(valBefore, valAfter, data[current]) else valAfter

def getOutlierLists(data, distance):
    outlierList = []
    outlierList.extend(data[data > distance].tolist())
    outlierList.extend(data[data < -distance].tolist())

    outlierListIndecies = [i for i, j in enumerate(data) if j in outlierList]

    return(outlierList, outlierListIndecies)

def filloutliers(data):
    stad = std(data)
    mean = np.mean(data)
    distance = 3*stad + mean

    (outlierList, outlierListIndecies) = getOutlierLists(data, distance)

    print(outlierList, " | ", outlierListIndecies, " | ", distance, " | ", mean)

    for i in range(len(outlierList)):
        data[outlierListIndecies[i]] = findNearestValue(data, outlierListIndecies[i] - 1, outlierListIndecies[i], outlierListIndecies[i] + 1)

    (outlierList, outlierListIndecies) = getOutlierLists(data, distance)

    if(len(outlierList) != 0):
        for i in reversed(range(len(outlierList))):
            data[outlierListIndecies[i]] = findNearestValue(data, outlierListIndecies[i] - 1, outlierListIndecies[i], outlierListIndecies[i] + 1)

    return data
将numpy导入为np
从数学导入sqrt
从统计数据导入stdev作为std
def比较器IGHBORS(之前、当前、之后):
valBefore=(之前-当前)
valAfter=(当前之后)
打印(valBefore)
打印(valAfter)
返回(valBefore-1,则为before,否则为0
after=after if afterdistance].tolist())
outlierList.extend(数据[data<-distance].tolist())
OutlierListIndexes=[i代表i,j在枚举(数据)中,如果j在outlierList中]
返回值(异常值列表、异常值列表索引)
def填充异常值(数据):
stad=标准(数据)
平均值=np.平均值(数据)
距离=3*stad+平均值
(离群列表,离群列表索引)=获取离群列表(数据,距离)
打印(异常列表“|”、异常列表索引“|”、距离“|”、平均值)
对于范围内的i(len(异常值列表)):
数据[OutlierListIndicates[i]]=findNearestValue(数据,OutlierListIndicates[i]-1,OutlierListIndicates[i],OutlierListIndicates[i]+1)
(离群列表,离群列表索引)=获取离群列表(数据,距离)
如果(len(异常值列表)!=0):
对于反转的i(范围(len(异常值列表)):
数据[OutlierListIndicates[i]]=findNearestValue(数据,OutlierListIndicates[i]-1,OutlierListIndicates[i],OutlierListIndicates[i]+1)
返回数据
异常值:[453.675231]
数组中的位置:[46]
最大值 对于之后的值为离群值:+/-415.6792282141016
平均值:99.862390280000001

输入数据:
[0.0195.471464000003,0.0143.1795457, 19.7727047, 0.0, 37.9259413, 67.4346233, 175.714837, 140.72522700000002, 42.116339999999994, 0.0, 11.829232000000005, 0.0, 225.20435399999997, 25.939856999999996, 9.875561000000005, 0.0, 30.22819100000001, 141.658386, 191.42069600000002, 182.451406, 188.27667599999998, 0.0, 192.48585400000002, 0.0, 79.817566, 94.469158, 97.0669257, 153.0584423, 87.5491337, 0.0, 87.5491337, 0.0, 377.6008777, 176.6662877, 397.683778, 82.18773, 136.917358, 79.201378, 57.71598, 1.795560000000009, 1.795560000000009, 19.405960000000007, 135.51628, 0.0, 453.675231, 211.71818100000002, 109.460083, 13.761809999999997, 0.0, 114.462883, 7.609375, 159.630814, 9.943822999999998, 0.0, 93.460329, 55.87061700000001, 46.083324000000005, 58.686195999999995, 18.636627, 0.0, 22.810349000000002, 144.659505, 0.0, 267.669085, 290.303405, 110.52316300000001, 52.656178, 110.52316300000001, 52.656178, 123.26508600000001, 61.89890700000001, 158.23855600000002, 194.428161, 181.365445, 264.36523, 0.0, 274.60668, 48.543030000000016, 308.51727600000004, 357.209626, 24.18412, 46.621155,70.805275,181.781889,364.741453,0.0,143.6235490000003,0.0,4.20169100000004,0.0,0.0,135.2808976,87.3988186,216.920091,84.215256,161.518512,0.0]

输出数据:
[0.0195.471464000003,0.0143.1795457, 19.7727047, 0.0, 37.9259413, 67.4346233, 175.714837, 140.72522700000002, 42.116339999999994, 0.0, 11.829232000000005, 0.0, 225.20435399999997, 25.939856999999996, 9.875561000000005, 0.0, 30.22819100000001, 141.658386, 191.42069600000002, 182.451406, 188.27667599999998, 0.0, 192.48585400000002, 0.0, 79.817566, 94.469158, 97.0669257, 153.0584423, 87.5491337, 0.0, 87.5491337, 0.0, 377.6008777, 176.6662877, 397.683778, 82.18773, 136.917358, 79.201378, 57.71598, 1.795560000000009, 1.795560000000009, 19.405960000000007, 135.51628, 0.0, 0.0, 211.71818100000002, 109.460083, 13.761809999999997, 0.0, 114.462883, 7.609375, 159.630814, 9.943822999999998, 0.0, 93.460329, 55.87061700000001, 46.083324000000005, 58.686195999999995, 18.636627, 0.0, 22.810349000000002, 144.659505, 0.0, 267.669085, 290.303405, 110.52316300000001, 52.656178, 110.52316300000001, 52.656178, 123.26508600000001, 61.89890700000001, 158.23855600000002, 194.428161, 181.365445, 264.36523, 0.0, 274.60668, 48.543030000000016, 308.51727600000004, 357.209626, 24.18412, 46.621155,70.805275181.781889364.741453,0.01436235490000003,0.0,4.201691000000004,0.0,0.0,135.2808976,87.3988186216.920091,84.215256161.518512,0.0]


这只适用于这个特定的用例,您需要用离平均值3个标准偏差的最近值填充异常值

import numpy as np
from math import sqrt
from statistics import stdev as std

def isNotOutlier(point, upper, lower):
    return (point < upper and point > lower)

def findNearestValue(data, before, current, after, threshAbove, threshBelow):
    before = before if before > -1 else 0
    after = after if after < len(data) else len(data) - 1


    while(True):
        if(after < len(data) and isNotOutlier(data[after],threshAbove,threshBelow)):
            return data[after]
        after += 1
        if(before >= 0 and isNotOutlier(data[before],threshAbove,threshBelow)):
            return data[before]
        before -= 1


def getOutlierLists(data, distancePos, distanceNeg):
    outlierList = []
    outlierList.extend(data[data > distancePos].tolist())
    outlierList.extend(data[data < distanceNeg].tolist())

    outlierListIndecies = [i for i, j in enumerate(data) if j in outlierList]

    return(outlierList, outlierListIndecies)

def filloutliers(data):
    stad = std(data)
    mean = np.mean(data)
    distancePos = 3*stad + mean
    distanceNeg = (-3*stad) + mean

    (outlierList, outlierListIndecies) = getOutlierLists(data, distancePos, distanceNeg)
    
    toReplace =[]

    for i in range(len(outlierList)):
        toReplace.append(findNearestValue(data, outlierListIndecies[i] - 1, outlierListIndecies[i], outlierListIndecies[i] + 1, distancePos, distanceNeg))

    for i in range(len(outlierListIndecies)):
        data[outlierListIndecies[i]] = toReplace[i]
        
    return data
将numpy导入为np
从数学导入sqrt
从统计数据导入stdev作为std
def不均匀(点、上、下):
返回(点<上,点>下)
def findNearestValue(数据、之前、当前、之后、阈值上方、阈值下方):
before=如果before>-1,则为before,否则为0
after=after if after=0之前且不是更高的值(数据[之前]、高于阈值、低于阈值)):
返回数据[之前]
之前-=1
def GetOutliers列表(数据、距离POS、距离NEG):