Python 在数据帧上应用条件以过滤阵列时发出未来警告
我已经对大约1000个观察值的数组应用了PCA,但仅当原始数组中的一个特征=某个值时,才希望将观察值保留在新数组中 我有一个numpy数组Python 在数据帧上应用条件以过滤阵列时发出未来警告,python,arrays,pandas,numpy,future-warning,Python,Arrays,Pandas,Numpy,Future Warning,我已经对大约1000个观察值的数组应用了PCA,但仅当原始数组中的一个特征=某个值时,才希望将观察值保留在新数组中 我有一个numpy数组df2和一个数据帧df。我想查找df2中的所有行,其中df.Position是CDM 我的实际数据: df2 [[ -6.00987823e+00 4.46585005e+00] [ -7.09055159e+00 1.89437600e+00] [ -5.91044431e+00 -1.97888707e+00] [ -4.85698965
df2
和一个数据帧df
。我想查找df2
中的所有行,其中df.Position
是CDM
我的实际数据:
df2
[[ -6.00987823e+00 4.46585005e+00]
[ -7.09055159e+00 1.89437600e+00]
[ -5.91044431e+00 -1.97888707e+00]
[ -4.85698965e+00 -1.09936724e+00]
[ -4.01780368e-01 -2.57178392e+00]
[ -2.97351215e+00 -3.15940358e+00]
[ -4.27973589e+00 2.82707326e+00]
[ 3.95086576e+00 1.08281922e+00]
[ -2.94075361e+00 -1.95544661e+00]
[ -4.83788056e+00 2.32369496e+00]
[ -5.00473716e+00 -3.37680552e-01]
[ -4.88905829e+00 -1.55527476e+00]
[ -3.38202709e+00 -1.04402867e+00]
[ -2.14261510e+00 -5.30757477e-01]
[ 3.00813803e-01 -2.11010985e+00]
[ -2.67824986e+00 -1.83303905e+00]
[ -1.64547049e+00 -2.48056250e+00]
[ -2.92550543e+00 -3.02363170e+00]
[ -4.01116933e+00 2.90363840e+00]
[ -1.04571206e+00 7.58064433e-01]
[ 2.34068739e-01 -2.33981296e+00]
[ 3.15597517e+00 1.09429188e+00]
[ -3.83828970e+00 1.14195305e-01]
[ -7.33794066e-01 -3.70152816e+00]
[ 8.21789967e-01 -4.77818413e-01]
[ -3.29257688e+00 -1.61887349e+00]
[ -4.24297171e+00 2.27187714e+00]
[ 1.45714199e+00 -3.56024788e+00]
[ 1.79855738e+00 -3.71818328e-01]
[ 3.68171085e-01 -3.52961707e+00]
[ 3.77585412e+00 -3.01627595e-01]
[ -4.21740128e+00 -1.30913719e+00]
[ -3.85041585e+00 -1.05515969e+00]
[ -5.01752378e+00 4.67348167e-01]
[ 3.65943448e+00 9.21016483e-01]
[ 3.12159896e+00 -1.25707872e-01]
[ -4.50219722e+00 -4.06752784e+00]
[ -3.92172250e+00 -2.88567430e+00]
[ -2.68908475e-01 -2.17506629e+00]
[ -1.13728112e+00 -2.66843007e+00]
[ -8.73467957e-01 -1.24389494e+00]
[ 3.21966300e+00 -1.35271239e-01]
[ -4.31060796e+00 -1.90505910e+00]
[ 3.73904981e+00 7.70228802e-01]
[ 1.02646986e+00 -5.91828676e-01]
[ 8.43840480e-01 -1.49636218e+00]
[ 1.54065978e+00 -1.65086030e+00]
[ 2.96602068e+00 -7.41024474e-01]
[ 6.53636345e-01 3.04647288e-01]
[ 2.59236989e+00 -6.70435261e-02]
[ 2.00184665e-01 -1.55230314e+00]
[ -7.29533092e-01 -2.73390749e+00]
[ -2.93578745e+00 -2.18118257e+00]
[ -4.37481195e+00 1.02701222e+00]
[ 1.00713302e+00 -1.39943282e+00]
...]
df
(足球/足球中的简单位置-FB、CB、CDM、CM、AM、FW)
过滤时,我会得到以下输出(以及
FutureWarning
):
我哪里出了问题?如何适当过滤数据?未来警告可能是由于您的
numpy
和pandas
版本过时造成的。您可以使用以下方法升级它们:
pip install --upgrade numpy pandas
至于过滤,有很多选择。在这里,我提到每一个都有一些虚拟数据
设置
df
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
df2
0 1
0 0.122 0.222
1 0.343 0.345
2 0.345 0.563
选项1布尔索引
选项2df.eval
请注意,即使df2
是表单的numpy
数组,这两个选项也可以工作:
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
对于实际数据,您需要按照相同的思路进行操作:
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])
我认为你需要:
嗨,谢谢你的回答。当过滤工作时,新阵列的输出意味着满足if条件的每个观测值具有相同的值,而不具有不同的值。如果你能帮忙的话,我已经更新了我的帖子。嗯,不知道你是否理解。但是有一个想法-在PCA之前不可能进行过滤,比如
arr=df.drop(['name','color'],axis=1)[mask].values
,然后使用PCA=PCA(n_components=2)
?我想在整个PCA之后进行过滤,但我想分离PCAOK之后各种事情的组,为什么我的答案中的解决方案失败了?不需要按条件筛选行?请参阅我的编辑。你和COLDSPEED的解决方案都完成了一半的工作。虽然它可以区分满足条件的行,但它不会从输出中忽略它们。相反,出于某种原因,它将符合条件的观测值的PCA“坐标”设置为第二行坐标,将不符合条件的观测值设置为第一行坐标-ords@COLDSPEED嗨,谢谢你的回答。当过滤工作时,新阵列的输出意味着满足if条件的每个观测值具有相同的值,而不具有不同的值。如果你能帮忙的话,我已经更新了我的帖子。
df2[df.eval('colour == "red"')]
Out[732]:
0 1
0 0.122 0.222
1 0.343 0.345
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])
from sklearn.decomposition import PCA
import pandas as pd
d = {'d': [4, 5, 5],
'a': [1, 2, 1],
'name': ['john', 'james', 'jane'],
'e': [5, 6, 7],
'f': [6, 7, 8], 'c': [3, 4, 3],
'b': [2, 3, 2],
'colour': ['red', 'red', 'blue']}
cols = ['name', 'colour', 'a', 'b', 'c', 'd', 'e', 'f']
df = pd.DataFrame(d, columns = cols)
print (df)
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
#create mask by condition
mask = df['colour'] == 'red'
#for multiple values
#mask = df['colour'].isin(['red', 'green', 'blue'])
print (mask)
0 True
1 True
2 False
Name: colour, dtype: bool
#filter only numeric values and convert to numpy array
arr = df.drop(['name','colour'], axis=1).values
print (arr)
[[1 2 3 4 5 6]
[2 3 4 5 6 7]
[1 2 3 5 7 8]]
pca = PCA(n_components=5)
pca.fit(arr)
print (pca.components_ )
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]
[ 0.62531284 -0.13184966 -0.136648 -0.71363037 0.17840759 0.17840759]]
#filter by condition
arr1 = pca.components_ [mask]
print (arr1)
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]]