Python 重复与下一行进行比较,直到符合标准
我想过滤数据帧Python 重复与下一行进行比较,直到符合标准,python,pandas,algorithm,Python,Pandas,Algorithm,我想过滤数据帧df: Id Timestamp Data diff1 10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0 10856168 18675686 2010-03-01 05:58:16.863 26.0 1.0 10856169 18675687 2010-03-01 05:58:18.203 30.5 4
df
:
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856168 18675686 2010-03-01 05:58:16.863 26.0 1.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856172 18675690 2010-03-01 05:58:22.223 43.0 1.0
10856175 18675693 2010-03-01 05:58:41.127 42.5 -0.5
10856176 18675694 2010-03-01 05:58:42.503 42.0 -0.5
10856177 18675695 2010-03-01 05:58:49.313 42.5 0.5
10856178 18675696 2010-03-01 05:58:50.663 43.0 0.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
10856183 18675701 2010-03-01 05:59:04.153 41.5 -0.5
10856184 18675702 2010-03-01 05:59:05.497 41.0 -0.5
10856185 18675703 2010-03-01 05:59:29.880 41.5 0.5
10856186 18675704 2010-03-01 05:59:31.220 42.0 0.5
10856191 18675709 2010-03-01 05:59:42.053 42.5 0.5
10856192 18675710 2010-03-01 05:59:43.407 43.0 0.5
10856193 18675711 2010-03-01 05:59:44.753 42.0 -1.0
10856218 18675736 2010-03-01 06:05:21.360 41.5 -0.5
通过比较当前行和下一行的df['Data']
。如果值之间的绝对差值大于1,则保留下一行,该行将成为新的当前行;否则,删除下一行并与下一行进行比较,直到找到满足条件的行。我尝试了diff()
和shift()
但它们只比较相邻的行
因此,预期产出:
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
最好的方法是什么
更新 尝试:
from numba import njit
@njit
def f(x, lim):
total = x[0]
result = np.empty(len(x), dtype=bool)
result[0] = True
for j,i in enumerate(x[1:], 1):
if abs(total - i) <= lim:
result[j] = False
else:
total = i
result[j] = True
return result
N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)
来自numba import njit
@njit
def f(x,lim):
总计=x[0]
结果=np.empty(len(x),dtype=bool)
结果[0]=真
对于j,i在枚举中(x[1:],1):
如果abs(total-i)18 df1=样本[f(sample.Data.values,N)]
19打印(df1)
~/opt/anaconda3/lib/python3.7/site-packages/numba/core/dispatcher.py in(编译)for(self,*args,**kws)
399 e.patch_信息(msg)
400
-->401错误\u重写(例如,“键入”)
402错误除外。不支持错误为e:
403#用户代码中存在不支持的内容,请添加帮助信息
~/opt/anaconda3/lib/python3.7/site-packages/numba/core/dispatcher.py出错\u重写(e,问题类型)
342上升e
343其他:
-->344重新拍卖(e类、e类、无)
345
346 argtypes=[]
重新发布中的~/opt/anaconda3/lib/python3.7/site-packages/numba/core/utils.py(tp、value、tb)
77值=tp()
78如果值.\uuuu回溯\uuuuu不是tb:
--->79通过回溯(tb)提升值
80提高价值
81
TypingError:在nopython模式管道中失败(步骤:nopython前端)
非精确类型数组(pyobject,1d,C)
[1] 期间:在(5)处键入参数
文件“”,第5行:
def f(x,lim):
总计=x[0]
^
如果性能很重要,我认为这是处理循环的方法:
from numba import njit
@njit
def f(x, lim):
total = x[0]
result = np.empty(len(x), dtype=np.bool8)
result[0] = True
for j,i in enumerate(x[1:], 1):
if abs(total - i) <= lim:
result[j] = False
else:
total = i
result[j] = True
return result
N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
来自numba import njit
@njit
def f(x,lim):
总计=x[0]
结果=np.empty(len(x),dtype=np.bool8)
结果[0]=真
对于j,i在枚举中(x[1:],1):
如果abs(total-i)@jezrael,请参见编辑的问题以获取更多数据。非常感谢。谢谢你的解决方案。它返回了一个错误-请参阅问题更新。@NilsineLabre-已测试,问题似乎dtype=bool
,需要dtype=np.bool8
太棒了,它可以工作了!请您对代码进行一些解释,特别是enumerate(x[1:],1)
中j,i的行?@nilsinelabore-这意味着处理所有行时不使用第一个,因为第一个总是正确的。因此,x[1://code>省略第一个值并枚举第一个值是1
,因为,1)
嗨,耶兹雷尔,你能看看这个问题吗?谢谢你的回答。我认为这与预期产出不同。我正在寻找下一行,其中它的数据
与当前数据的差异大于1。但是输出中的最后两行
的差异小于1。
from numba import njit
@njit
def f(x, lim):
total = x[0]
result = np.empty(len(x), dtype=np.bool8)
result[0] = True
for j,i in enumerate(x[1:], 1):
if abs(total - i) <= lim:
result[j] = False
else:
total = i
result[j] = True
return result
N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
i = 0
for row in range(2, len(df)):
i += 1
if i <= len(df) - 1:
if -1 <= df.iloc[i,:]['Data'] - df.iloc[i-1,:]['Data'] <= 1:
df.iloc[i,3] = ''
df.loc[df['diff1'] == '',:] = ''
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25 0
10856168
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10
10856171 18675689 2010-03-01 05:58:20.877 42 1.5
10856172
10856175
10856176
10856177
10856178
10856181
10856182 18675700 2010-03-01 05:59:02.797 42 -1.5
10856183
10856184
10856185
10856186
10856191
10856192
10856193
10856218 18675736 2010-03-01 06:05:21.360 41.5 -0.5