Python 在数据帧中查找使相关性变差的最差元素_Python_Pandas_Data Analysis

Python 在数据帧中查找使相关性变差的最差元素

python pandas

Python 在数据帧中查找使相关性变差的最差元素,python,pandas,data-analysis,Python,Pandas,Data Analysis,我想在pandas.DataFrame中找到使相关性变差的最差记录，以删除异常记录当我有以下数据帧时： df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,30]}) 删除第三行后相关性会更好 print df.corr() #-> correlation is 0.88 print df.ix[0:1].corr() # -> correlation is 1.00 在这种情况下，我的问题是如何找到第三行是使相关性更差的异常候选我的想法是执行线

我想在pandas.DataFrame中找到使相关性变差的最差记录，以删除异常记录

当我有以下数据帧时：

df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,30]})

删除第三行后相关性会更好

print df.corr() #-> correlation is 0.88
print df.ix[0:1].corr() # -> correlation is 1.00

在这种情况下，我的问题是如何找到第三行是使相关性更差的异常候选

我的想法是执行线性回归并计算每个元素（行）的误差。但是，我不知道尝试这个想法的简单方法，我也相信还有更简单和直接的方法

更新

当然，您可以删除所有元素并实现相关性为1。但我只想找到一个（或多个）异常行。直观地说，我希望得到一组非平凡的记录，以实现更好的相关性。

首先，您可以强制它获得精确的解决方案：

import pandas as pd
import numpy as np
from itertools import combinations, chain, imap

df = pd.DataFrame(zip(np.random.randn(10), np.random.randn(10)))

# set the maximal number of lines you are willing to remove
reomve_up_to_n = 3

# all combinations of indices to keep
to_keep = imap(list, chain(*map(lambda i: combinations(df.index, df.shape[0] - i), range(1, reomve_up_to_n + 1))))

# find index with highest remaining correlation
highest_correlation_index = max(to_keep, key = lambda ks: df.ix[ks].corr().ix[0,1])

df_remaining = df.ix[highest_correlation_index]

df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])

这可能代价高昂。您可以通过添加一个列来获得贪婪近似值，该列类似于行对相关性的贡献

df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])

现在您可以删除从顶部开始的行，这可能会提高相关性。

您的问题是关于。有许多方法可以执行此检测，但一种简单的方法是排除偏差超过系列标准偏差x%的值

df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])

# Keep only values with a deviation less than 10% of the standard deviation of the series.
df[np.abs(df.b-df.b.mean())<=(1.1*df.b.std())]

# result
   a  b
0  1  1
1  2  2

#仅保留偏差小于系列标准偏差10%的值。
df[np.abs（df.b-df.b.mean（））对于此数据帧，删除任何行将使相关性为1。是的。谢谢@Rob，我更新了问题。