Python 在数据帧中用高于某个阈值的值分隔连续区域
我有一个索引和值的数据帧,介于0和1之间,如下所示:Python 在数据帧中用高于某个阈值的值分隔连续区域,python,numpy,pandas,Python,Numpy,Pandas,我有一个索引和值的数据帧,介于0和1之间,如下所示: 6 0.047033 7 0.047650 8 0.054067 9 0.064767 10 0.073183 11 0.077950 [(150, 185), (632, 680), (1500,1870)] 632 0.545700 633 0.574983 634 0.572083 635 0.595500 636 0.632033 637 0.657617 638 0.643300 639 0
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
[(150, 185), (632, 680), (1500,1870)]
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
我想检索超过5个连续值的区域的开始点和结束点的元组,这些值都超过某个阈值(例如0.5)。这样我就有了这样的东西:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
[(150, 185), (632, 680), (1500,1870)]
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
其中,第一个元组是从索引150开始的区域,有35个值在行中均大于0.5,并在索引185结束(不包括)
我一开始只过滤0.5以上的值,就像这样
df = df[df['values'] >= 0.5]
现在我有了这样的价值观:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
[(150, 185), (632, 680), (1500,1870)]
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
我不能显示我的实际数据集,但下面的数据集应该是一个很好的表示
import numpy as np
from pandas import *
np.random.seed(seed=901212)
df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35
屈服:
1 0.491233
2 0.538596
3 0.516740
4 0.381134
5 0.670157
6 0.846366
7 0.495554
8 0.436044
9 0.695597
10 0.826591
...
其中区域(2,4)有两个大于0.5的值。然而,这将是太短了。另一方面,一行中有19个值高于0.5的区域(25,44)将添加到列表中。我认为这会打印出您想要的内容。这在很大程度上是基于我想这是适当的投票
import numpy as np
# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition,n=1, axis=0)
idx, _ = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right. -JK
# LB this copy to increment is horrible but I get
# ValueError: output array is read-only without it
mutable_idx = np.array(idx)
mutable_idx += 1
idx = mutable_idx
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def main():
import pandas as pd
RUN_LENGTH_THRESHOLD = 5
VALUE_THRESHOLD = 0.5
np.random.seed(seed=901212)
data = np.random.rand(500)*.5 + .35
df = pd.DataFrame(data=data,columns=['values'])
match_bools = df.values > VALUE_THRESHOLD
print('with boolian array')
for start, stop in contiguous_regions(match_bools):
if (stop - start > RUN_LENGTH_THRESHOLD):
print (start, stop)
if __name__ == '__main__':
main()
如果没有更优雅的方法,我会感到惊讶。您可以通过查看序列和1行移位值找到每个连续区域的第一个和最后一个元素,然后过滤彼此充分分离的对:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
例如,第一个区域是:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True
很好的解决方案。你认为即使在更大的数据集上,这种方法也能保持相当高的性能吗?@Higany我想最好的答案是尝试一下,看看会如何scale@Higany这些都是矢量化的操作(除了获取实际的索引器),所以性能应该相当好。谢谢,但如果可能的话,我更喜欢基于熊猫的版本。出于性能方面的考虑,NumPy版本可能更可取。熊猫是在考虑性能的情况下编写的。如果@behzad.nourisolution中的解决方案同样快,我也不会感到惊讶。可用于快速分析