Python 如何查找熊猫系列中与输入编号最接近的值？_Python_Pandas_Dataframe_Ranking

Python 如何查找熊猫系列中与输入编号最接近的值？

python pandas dataframe

Python 如何查找熊猫系列中与输入编号最接近的值？,python,pandas,dataframe,ranking,Python,Pandas,Dataframe,Ranking,我看到：这些与香草蟒蛇有关，而不是熊猫如果我有这个系列： ix num 0 1 1 6 2 4 3 5 4 2 我输入3，如何（有效地）查找如果在序列中找到，则索引为3 如果在序列中找不到低于或高于3的值的索引也就是说，通过上面的序列{1,6,4,5,2}，输入3，我应该得到带有索引（2,4）的值（4,2）。您可以像这样使用argsort（）比如说，input=3 In [198]: input = 3 In [199]: df.ilo

我看到：

这些与香草蟒蛇有关，而不是熊猫

如果我有这个系列：

我输入3，如何（有效地）查找

如果在序列中找到，则索引为3

如果在序列中找不到低于或高于3的值的索引

也就是说，通过上面的序列{1,6,4,5,2}，输入3，我应该得到带有索引（2,4）的值（4,2）。
您可以像这样使用
argsort（）比如说，input=3 In [198]: input = 3 In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]] Out[199]: num 2 4 4 2 df_sort 是具有两个最接近值的数据帧 In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]] 对于索引 In [201]: df_sort.index.tolist() Out[201]: [2, 4] 对于价值观 In [202]: df_sort['num'].tolist() Out[202]: [4, 2] 上述解决方案的详细信息df In [197]: df Out[197]: num 0 1 1 6 2 4 3 5 4 2 除了John Galt的答案之外，我建议使用iloc ，因为这甚至适用于未排序的整数索引，因为它首先查看索引标签 df.iloc[(df['num']-input).abs().argsort()[:2]] 如果您的序列已经排序，您可以使用类似这样的内容 def closest(df, col, val, direction): n = len(df[df[col] <= val]) if(direction < 0): n -= 1 if(n < 0 or n >= len(df)): print('err - value outside range') return None return df.ix[n, col] df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num']) for find in range(-1, 2): lc = closest(df, 'num', find, -1) hc = closest(df, 'num', find, 1) print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc)) df: num 0 0 1 2 2 4 3 6 4 8 err - value outside range Closest to -1 is None, lower and 0, higher. Closest to 0 is 0, lower and 2, higher. Closest to 1 is 0, lower and 2, higher. def最近距离（df、col、val、方向）： n=len（df[df[col]=len（df））：打印（'错误-值超出范围'）一无所获返回df.ix[n，col] df=pd.DataFrame（pd.Series（范围（0,10,2）），列=['num'] 对于范围内的查找（-1,2）： lc=最近（df，'num'，find，-1） hc=最近（df，‘num’，find，1） print（'最接近{}的是{}，较低和{}，较高。'.format（find，lc，hc）） df:num 0 0 1 2 2 4 3 6 4 8 err-值超出范围最接近-1的值为无，较低，0，较高。最接近0的是0，更低，2，更高。最接近1的值为0，较低；最接近2的值为2，较高。除了没有完全回答这个问题之外，这里讨论的其他算法的另一个缺点是它们必须对整个列表进行排序。这导致复杂性~N log（N）但是，在~N中也可以获得相同的结果。这种方法将数据帧分为两个子集，一个子集比所需值小，一个子集比所需值大。较小数据帧中的下邻居比最大值小，反之亦然这将提供以下代码段： def find_邻居（值、df、colname）： exactmatch=df[df[colname]==value] 如果不是exactmatch.empty：返回exactmatch.index 其他： lowerneighbour_ind=df[df[colname]value][colname].idxmin（）返回[lowerneighbour\u ind，Upper Neighbour\u ind] 这种方法类似于使用，在处理大型数据集时非常有用，复杂性成为一个问题比较这两种策略表明，对于大N，分区策略确实更快。对于小N，排序策略将更有效，因为它是在一个更低的级别实现的。它也是一个单行程序，可能会增加代码可读性。复制此图的代码如下所示： from matplotlib import pyplot as plt import pandas import numpy import timeit value=3 sizes=numpy.logspace(2, 5, num=50, dtype=int) sort_results, partition_results=[],[] for size in sizes: df=pandas.DataFrame({"num":100*numpy.random.random(size)}) sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index", globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange()) partition_results.append(timeit.Timer('find_neighbours(df,value)', globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange()) sort_time=[time/amount for amount,time in sort_results] partition_time=[time/amount for amount,time in partition_results] plt.plot(sizes, sort_time) plt.plot(sizes, partition_time) plt.legend(['Sorting','Partitioning']) plt.title('Comparison of strategies') plt.xlabel('Size of Dataframe') plt.ylabel('Time in s') plt.savefig('speed_comparison.png') 如果序列已经排序，则使用函数是查找索引的有效方法。例如： idx = bisect_left(df['num'].values, 3) 让我们考虑数据帧df 的列col 已排序在值val 位于列中的情况下，bisect\u left 将返回列表中值的精确索引，并 bisect_right 将返回下一个位置的索引如果该值不在列表中，则两个bisect\u left 而bisect\u right将返回相同的索引：指向插入值以保持列表排序因此，为了回答这个问题，下面的代码给出了col 中val 的索引（如果找到了），以及最接近的值的索引（否则）。即使列表中的值不是唯一的，此解决方案也能起作用从对分导入左对分，右对分 def get_closests（df、col、val）：下切分=左切分（df[col]。值，val）较高的idx=右二等分（df[col]。值，val）如果较高的_idx==较低的_idx:#val不在列表中返回下_idx-1，下_idx 其他：#瓦尔在列表中返回下箭头idx 对分算法对于在数据帧列“col”中查找特定值“val”的索引非常有效，或其最近的邻居，但它需要对列表进行排序。您可以使用numpy.searchsorted 。如果您的搜索列尚未排序，您可以创建一个已排序的数据框，并使用pandas.argsort 记住它们之间的映射。（如果计划多次查找最接近的值，则此方法优于上述方法。）排序后，为输入查找最接近的值，如下所示： indLeft=np.searchsorted（df['column']，input，side='left'） indRight=np.searchsorted（df['column']，input，side='right'） valLeft=df['column'][indLeft] valRight=df['column'][indRight] 这里有很多答案，其中很多都很好。没有一个答案被接受，@Zero的答案目前被评为最高级别。另一个答案指出，当索引尚未排序时，它不起作用，但他/她推荐了一个似乎不推荐的解决方案我发现我可以通过以下方式对值本身使用numpy版本的argsort（），即使索引未排序也可以： df.iloc[(df['num']-input).abs()..values.argsort()[:2]] 请参阅Zero的答案了解上下文。我发现解决这类问题最直观的方法是使用@ivo merchiers建议的分区方法，但使用nsmallest和nlargest。除了处理未排序的序列外，这种方法的一个好处是，通过将k_匹配设置为数字g，可以轻松获得几个接近的值大于1 import pandas as pd source = pd.Series([1,6,4,5,2]) target = 3 def find_closest_values(target, source, k_matches=1): k_above = source[source >= target].nsmallest(k_matches) k_below = source[source < target].nlargest(k_matches) k_all = pd.concat([k_below, k_above]).sort_values() return k_all find_closest_values(target, source, k_matches=1) 这是在最下面和最上面找到的，还是仅仅 4 2 2 4 dtype: int64