Python 识别非连续零的索引值
我有一个由负数和零组成的数据帧,还有一个日期时间索引 我希望能够: (1) 确定非连续、非零值的开始和结束日期; (2) 这两个日期之间的天数; (3) 这两个日期之间的最小值 例如,如果我的数据框如下所示:Python 识别非连续零的索引值,python,python-3.x,pandas,numpy,Python,Python 3.x,Pandas,Numpy,我有一个由负数和零组成的数据帧,还有一个日期时间索引 我希望能够: (1) 确定非连续、非零值的开始和结束日期; (2) 这两个日期之间的天数; (3) 这两个日期之间的最小值 例如,如果我的数据框如下所示: DATE VAL 2007-06-26 0.000000 2007-06-27 0.000000 2007-06-28 0.000000 2007-06-29 -0.006408 2007-07-02 0.000000 2007-07-03 0.000000 2
DATE VAL
2007-06-26 0.000000
2007-06-27 0.000000
2007-06-28 0.000000
2007-06-29 -0.006408
2007-07-02 0.000000
2007-07-03 0.000000
2007-07-04 -0.000003
2007-07-05 0.000000
2007-07-06 0.000000
2007-07-09 0.000000
2007-07-10 -0.018858
2007-07-11 -0.015624
2007-07-12 0.000000
2007-07-13 0.000000
2007-07-16 -0.008562
2007-07-17 -0.006587
START END DAYS MIN
2007-06-29 2007-06-29 1 -0.006408
2007-07-04 2007-07-04 1 -0.000003
2007-07-10 2007-07-11 2 -0.018858
2007-07-16 2007-07-17 2 -0.008562
我希望输出如下所示:
DATE VAL
2007-06-26 0.000000
2007-06-27 0.000000
2007-06-28 0.000000
2007-06-29 -0.006408
2007-07-02 0.000000
2007-07-03 0.000000
2007-07-04 -0.000003
2007-07-05 0.000000
2007-07-06 0.000000
2007-07-09 0.000000
2007-07-10 -0.018858
2007-07-11 -0.015624
2007-07-12 0.000000
2007-07-13 0.000000
2007-07-16 -0.008562
2007-07-17 -0.006587
START END DAYS MIN
2007-06-29 2007-06-29 1 -0.006408
2007-07-04 2007-07-04 1 -0.000003
2007-07-10 2007-07-11 2 -0.018858
2007-07-16 2007-07-17 2 -0.008562
如果将天数排除在周末之外(即7/13到7/16算作1天),这会更好,但我意识到这通常很复杂
numpy.argmax/min
方法似乎实现了我想要的版本,但是根据文档设置axis=1
并没有返回我期望的索引值集合
编辑:应已指定,以查找不需要循环的解决方案。首先创建一个标志以查找非零记录并将其分配到相同的组中,然后创建groupby并计算所需的属性
(
df.assign(Flag = np.where(df.VAL.ge(0), 1, np.nan))
.assign(Flag = lambda x: x.Flag.fillna(x.Flag.cumsum().ffill()))
.loc[lambda x: x.Flag.ne(1)]
.groupby('Flag')
.apply(lambda x: [x.DATE.iloc[0], x.DATE.iloc[-1], len(x), x.VAL.min()])
.apply(pd.Series)
.set_axis(['START','END','DAYS','MIN'], axis=1, inplace=False)
)
START END DAYS MIN
Flag
3.0 2007-06-29 2007-06-29 1 -0.006408
5.0 2007-07-04 2007-07-04 1 -0.000003
8.0 2007-07-10 2007-07-11 2 -0.018858
10.0 2007-07-16 2007-07-17 2 -0.008562
您可以使用以下选项:
首先从文件读取数据帧:
import pandas as pd
df=pd.read_csv("file.csv")
输出:
和主要代码:
from datetime import timedelta
last_date=0
min_val=0
mat=[]
st=0
for index, row in df.iterrows():
if (row['VAL'])!=0:
st=st+1
datetime_object = datetime.strptime(row['DATE'], '%Y-%m-%d')
if st==1:
start=datetime_object
last_date=start
if row['VAL']<min_val:
min_val=row['VAL']
else:
if last_date+timedelta(days=1)==datetime_object:
last_date=datetime_object
if row['VAL']<min_val:
min_val=row['VAL']
else:
arr=[]
arr.append(str(start.date()))
arr.append(str(last_date.date()))
arr.append(((last_date-start).days)+1)
arr.append(min_val)
start=datetime_object
last_date=datetime_object
min_val=row['VAL']
mat.append(arr)
arr=[]
arr.append(str(start.date()))
arr.append(str(last_date.date()))
arr.append(((last_date-start).days)+1)
arr.append(min_val)
mat.append(arr)
df = pd.DataFrame(mat, columns = ['start', 'end', 'days', 'min'])
df
在0.25+条件下工作的解决方案:
#convert DatetimeIndex to column
df = df.reset_index()
#filter values equal 0
m = df['VAL'].eq(0)
#create groups only for non 0 rows filtering with inverting mask by ~
g = m.ne(m.shift()).cumsum()[~m]
#aggregation by groups
df1 = df.groupby(g).agg(START=('DATE','first'),
END=('DATE','last'),
DAYS= ('DATE', 'size'),
MIN=('VAL','min')).reset_index(drop=True)
print (df1)
START END DAYS MIN
0 2007-06-29 2007-06-29 1 -0.006408
1 2007-07-04 2007-07-04 1 -0.000003
2 2007-07-10 2007-07-11 2 -0.018858
3 2007-07-16 2007-07-17 2 -0.008562
熊猫的解决方案这一方案与最初的解决方案(Allen)有一些相似的逻辑,但较少“适用”。不确定性能比较
# a new group begins when previous value is 0, but current is negative
df['NEW_GROUP'] = df['VAL'].shift(1) == 0
df['NEW_GROUP'] &= df['VAL'] < 0
# Group by the number of times a new group has showed up, which determines the group number.
# Directly return a Series from `apply` to obviate further transformations
print(df.loc[df['VAL'] < 0]
.groupby(df['NEW_GROUP'].cumsum())
.apply(lambda x: pd.Series([x.DATE.iloc[0], x.DATE.iloc[-1], x.VAL.min(), len(x)],
index=['START','END','MIN','DAYS'])))
numpy
解决方案,df
是您的示例数据帧:
# get data to numpy
date = df.index.to_numpy(dtype='M8[D]')
val = df['VAL'].to_numpy()
# find switches between zero/nonzero
on,off = np.diff(val!=0.0,prepend=False,append=False).nonzero()[0].reshape(-1,2).T
# use switch points to calculate all desired quantities
out = pd.DataFrame({'START':date[on],'END':date[off-1],'DAYS':np.busday_count(date[on],date[off-1])+1,'MIN':np.minimum.reduceat(val,on)})
# admire
out
# START END DAYS MIN
# 0 2007-06-29 2007-06-29 1 -0.006408
# 1 2007-07-04 2007-07-04 1 -0.000003
# 2 2007-07-10 2007-07-11 2 -0.018858
# 3 2007-07-16 2007-07-17 2 -0.008562
谢谢你。需要说明的是,VALUE\u DATE列是一个索引…例如,在尝试调用df.VALUE\u DATE时,会出现错误
df.index
返回成功,尽管示例df中没有名为“VALUE\u DATE”的列。你的意思是日期列实际上被称为“VALUE_DATE”,是索引而不是列吗?是的,标签在实际df中是VALUE_DATE,我在示例中简化为DATE。谢谢。明确地说,使用循环实现这一点很简单……寻找一种不需要超级智能的解决方案。看起来是这样的,并且得到了一个错误,它缺少arg
参数:TypeError:aggregate(),缺少1个必需的位置参数:“arg”
@Chris-您的版本是什么?因为这里是用来在熊猫中工作的0.25+看起来我得到了0.23.4。有干净的解决办法吗?很容易更新pandas,但通常对更新持谨慎态度,以免出现意外情况。我们不想夸大,但主要是。它不是超级密集的,但更喜欢简洁的解决方案。谢谢。如果我尝试应用于索引,则第1行出现错误AttributeError:“DatetimeIndex”对象没有属性“to\u numpy”
;如果我重置索引并尝试应用于列DATE
,则AttributeError:“Series”对象没有属性“to\u numpy”
,则可能是版本问题。在过去,您将使用值
(无括号!),而不是来执行
。然后必须手动设置数据类型,即date=df.index.values.astype('M8[d'))
。
START END MIN DAYS
NEW_GROUP
1 2007-06-29 2007-06-29 -0.006408 1
2 2007-07-04 2007-07-04 -0.000003 1
3 2007-07-10 2007-07-11 -0.018858 2
4 2007-07-16 2007-07-17 -0.008562 2
# get data to numpy
date = df.index.to_numpy(dtype='M8[D]')
val = df['VAL'].to_numpy()
# find switches between zero/nonzero
on,off = np.diff(val!=0.0,prepend=False,append=False).nonzero()[0].reshape(-1,2).T
# use switch points to calculate all desired quantities
out = pd.DataFrame({'START':date[on],'END':date[off-1],'DAYS':np.busday_count(date[on],date[off-1])+1,'MIN':np.minimum.reduceat(val,on)})
# admire
out
# START END DAYS MIN
# 0 2007-06-29 2007-06-29 1 -0.006408
# 1 2007-07-04 2007-07-04 1 -0.000003
# 2 2007-07-10 2007-07-11 2 -0.018858
# 3 2007-07-16 2007-07-17 2 -0.008562