Python 在数据帧中查找（仅）满足给定条件的第一行_Python_Pandas

Python 在数据帧中查找（仅）满足给定条件的第一行

python pandas

Python 在数据帧中查找（仅）满足给定条件的第一行,python,pandas,Python,Pandas,我有一个数据帧df，有一个很长的随机正整数列： df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)}) 我想确定列中第一个偶数的索引。一种方法是： df[df.n % 2 == 0].iloc[0] 但是这涉及到很多操作（生成索引f.n%2==0，对这些索引进行评估df，最后获取第一项），而且速度非常慢。这样的循环要快得多： for j in range(len(df)): if df.n.iloc[j] %

我有一个数据帧

df

，有一个很长的随机正整数列：

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

我想确定列中第一个偶数的索引。一种方法是：

df[df.n % 2 == 0].iloc[0]

但是这涉及到很多操作（生成索引

f.n%2==0

，对这些索引进行评估

df

，最后获取第一项），而且速度非常慢。这样的循环要快得多：

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

也因为第一个结果可能在前几行。有没有类似性能的方法？多谢各位

注意：此条件（为偶数）只是一个示例我正在寻找一种解决方案，该解决方案适用于任何类型的值条件，即快速的单行替代：

df[ conditions on df.n ].iloc[0]

允许您迭代行并在满意时停止的选项是使用，这是熊猫的行迭代器

在这种情况下，您可以实现如下内容：

def get_first_row_with(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None # Condition not met on any row in entire DataFrame

然后，给定一个数据帧，例如：

df = pd.DataFrame({
                    'cats': [1,2,3,4], 
                    'dogs': [2,4,6,8]
                  }, 
                  index=['Alice', 'Bob', 'Charlie', 'Eve'])

您可以将其用作：

def some_condition(row):
    return row.cats + row.dogs >= 7

index, row = get_first_row_with(some_condition, df)

# Use results however you like, e.g.:
print('{} is the first person to have at least 7 pets.'.format(index))
print('They have {} cats and {} dogs!'.format(row.cats, row.dogs))

这将产生：

Charlie is the first person to have at least 7 pets.
They have 3 cats and 6 dogs!

是否进行了一些计时？是的，使用生成器通常会给您更快的结果

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]

我得到：

1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 µs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 µs per loop # <--- your solution

差别消失了：

10 loops, best of 3: 40.5 ms per loop 
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop

为了好玩，我决定尝试一些可能性。我使用数据帧：

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

（这次不是随机的。）我想为

的某个值找到

n>=n

的第一行。我对以下四个版本进行了计时：

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

用

=十的幂。当然，对于python中的

循环，numpy（优化的C）代码预计要比快，但我想看看N
python循环的哪些值仍然可以
我给台词计时：
getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

对于N=1,10,100,1000，…
。这是性能的日志图：

只要“第一个真实位置”预计在开始时，简单的for
循环就可以了，但随后它就变差了。np.argmax
是最安全的解决方案
从图中可以看出，pandas
和argmax
的时间保持（几乎）不变，因为它们总是扫描整个阵列。最好有一个np
或pandas
方法，它不同时包含索引和列。
Zip
，然后在其上循环以获得更快的循环速度Zip提供了最快的循环性能，比iterrows（）
或itertuples（）更快
TLDR：如果df.at[j，“n”]%2==0，则可以使用next（j表示范围内的j（len（df））


我认为用一行程序编写代码是完全可能的。让我们定义一个数据帧来证明这一点：
df = pd.DataFrame({'n': np.random.randint(1, 10, size = 100000)})

首先，您的代码给出：
for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
% 22.1 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

将其转换为oneliner可提供：
next(j for j in range(len(df)) if df["n"].iloc[j] % 2 == 0)
% 20.6 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

为了进一步加快计算速度，我们可以使用at
而不是iloc
，因为这在访问单个值时速度更快：
next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
% 8.88 µs ± 617 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

为什么不直接使用这个循环呢？列排序了吗？如果是这样，您可以尝试np.searchsorted
。如果没有，除了预排序之外，我认为没有任何矢量化的解决方案。@RNar:我正在学习pandas，我想知道如何在pandas中进行此操作@ayhan:是的，列已排序。但是如何使用np.searchsorted
指定复杂条件呢？例如，如何查找第一个偶数？如果如您所说，通常在前几行中满足条件，则可以执行df.iloc[：x，df.A>3.5].iloc[0]
仅搜索前x行。如果没有找到，根据您的数据和X的选择，搜索下一个X行，等等。否则，我可能会在Ayhana链接的一个答案中尝试numba函数。一天结束时，df上的条件是一个非常广泛的问题，根据具体的条件有不同的操作。无论如何，要摆脱与系列/专栏的元素级比较是很困难的。.iloc[0]
或者你在末尾加的任何东西都不是昂贵的部分。我同意。我希望在命中目标行时打破循环，从而跳过下面的行，将比找到迭代所有行的最快方法节省更多的时间。（特别是在大型数据帧上）谢谢Anton，我想我最终会接受在我的代码中编写一个循环，这是最快的选择。我认为您的比较是不公平的，因为使用OneLiner，您访问的是数据帧中n%2==0
的行，而对于for循环，您没有这样做。为了进行公平比较，您可以在三行代码中添加df.iloc[j]
，或者删除df.iloc
下一个语句周围的。如果我找不到for循环的替代方案，我会接受你的答案，因为我已经将这个for循环与原来的pandas版本进行了测试，如果在数组开始时满足条件，那么它似乎有类似的性能，然后效率就会降低（我的答案中的图表）至少有人提到for循环的复杂性将取决于预期结果的位置。。。是的，OP的解决方案并不像多数人所说的那样是最快的。。。
for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
% 22.1 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

next(j for j in range(len(df)) if df["n"].iloc[j] % 2 == 0)
% 20.6 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
% 8.88 µs ± 617 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)