Python 查找数据帧中的相对和绝对波动计数，其中每行包含一个时间序列_Python_Pandas_Time Series

Python 查找数据帧中的相对和绝对波动计数，其中每行包含一个时间序列

python pandas

Python 查找数据帧中的相对和绝对波动计数，其中每行包含一个时间序列,python,pandas,time-series,Python,Pandas,Time Series,我有一个包含金融时间序列表的数据框，每一行都有以下列：该时间序列的ID 一个目标值值（我们希望根据该值测量偏差，包括相对偏差和绝对偏差）以及各种日期的时间序列值：1/01、1/02、1/03，… 我们要计算每行/ID的时间序列的相对和绝对波动计数。然后我们想找出哪一行/ID的波动/峰值最大，如下所示：首先，我们发现两个时间序列值之间的差异并估计阈值。阈值表示在我们声明“波动”或“峰值”之前，两个值之间允许的差值。如果任何两列的值之间的差值高于您设置的阈值，则为峰值。但是，我们需要

我有一个包含金融时间序列表的数据框，每一行都有以下列：

该时间序列的ID
一个
```
目标值
```
值（我们希望根据该值测量偏差，包括相对偏差和绝对偏差）
以及各种日期的时间序列值：
```
1/01、1/02、1/03，…
```

我们要计算每行/ID的时间序列的相对和绝对波动计数。然后我们想找出哪一行/ID的波动/峰值最大，如下所示：

首先，我们发现两个时间序列值之间的差异并估计阈值。阈值表示在我们声明“波动”或“峰值”之前，两个值之间允许的差值。如果任何两列的值之间的差值高于您设置的阈值，则为峰值。

但是，我们需要确保阈值是通用的，并且适用于任何行中任意两个值之间的%和绝对值

所以基本上，我们以百分比形式找到一个阈值（做一个有根据的预测），因为我们有一行值以“%”形式表示。加上，“%”形式也可以正确处理绝对值

对于每一行/ID，输出应该是一个新的列波动计数（
FCount
），包括相对计数和绝对计数

代码：

尝试一个
loc
和一个
iloc
和一个
sub
和一个
abs
和一个
sum
和一个
idxmin
：

print(df.loc[df.iloc[:, 4:].sub(df['Target'].tolist(), axis='rows').abs().sum(1).idxmin(), 'ID'])
输出：

D1
说明：

我首先从第四列开始获取列，然后用相应的
Target
列减去每一行

然后得到它的绝对值，因此
-1.1
将是
1.1
，
1.1
仍将是
1.1
，然后
sum
将每行相加，得到数字最少的行

然后使用
loc
获取实际数据帧中的索引，并获取其
ID
列，该列提供
D1

给定

# importing pandas as pd import pandas as pd import numpy as np # Create sample dataframe raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'], 'Domain': ['Finance', 'IT', 'IT', 'Finance'], 'Target': [1, 2, 3, '0.9%'], 'Criteria':['<=', '<=', '>=', '>='], "1/01":[0.9, 1.1, 2.1, 1], "1/02":[0.4, 0.3, 0.5, 0.9], "1/03":[1, 1, 4, 1.1], "1/04":[0.7, 0.7, 0.1, 0.7], "1/05":[0.7, 0.7, 0.1, 1], "1/06":[0.9, 1.1, 2.1, 0.6],} df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01', '1/02','1/03', '1/04','1/05', '1/06'])
然后，我们可以使用来轻松获取基础数组上日期之间的差异。我们也会选择绝对值，因为这是我们感兴趣的

np.abs(np.diff(df[date_columns].values)) #Output: array([[0.5, 0.6, 0.3, 0. , 0.2], [0.8, 0.7, 0.3, 0. , 0.4], [1.6, 3.5, 3.9, 0. , 2. ], [0.1, 0.2, 0.4, 0.3, 0.4]])
现在，只需担心绝对阈值，这就像检查差值是否大于极限一样简单

abs_threshold = 0.5 np.abs(np.diff(df[date_columns].values)) > abs_threshold #Output: array([[False, True, False, False, False], [ True, True, False, False, False], [ True, True, True, False, True], [False, False, False, False, False]])
我们可以看到，该数组中每一行的和都会给出我们需要的结果（布尔数组中的和使用基础的True=1和False=0。因此，您可以有效地计算出有多少True存在）。对于百分比阈值，我们只需要执行额外的步骤，在比较之前将所有差异与原始值分开。把它们放在一起
详细说明：
我们可以看到，每一行的总和如何给出通过绝对阈值的值的计数，如下所示

abs_fluctuations = np.abs(np.diff(df[date_columns].values)) > abs_threshold print(abs_fluctuations.sum(-1)) #Output: [1 2 4 0]
从相对阈值开始，我们可以创建与前面相同的差异数组

dates = df[date_columns].values #same as before, but just assigned differences = np.abs(np.diff(dates)) #same as before, just assigned pct_threshold=0.5 #aka 50% print(differences.shape) #(4, 5) aka 4 rows, 5 columns if you want to think traditional tabular 2D shapes only print(dates.shape) #(4, 6) 4 rows, 6 columns
现在，请注意，differences数组的列数将减少1，这也是有意义的。因为对于6个日期，将有5个“差异”，每个间隔一个
现在，只关注一行，我们看到计算百分比变化很简单

print(dates[0][:2]) #for first row[0], take the first two dates[:2] #Output: array([0.9, 0.4]) print(differences[0][0]) #for first row[0], take the first difference[0] #Output: 0.5
从
0.9到0.4的变化是绝对值0.5的变化。但就百分比而言，它是一个变化0.5/0.9（差异/原始）*100（我省略了乘以100以使事情更简单） aka55.555% 或0.5555 这一步需要认识到的主要问题是，我们需要对所有差异的“原始”值进行划分，以获得百分比变化。但是，dates数组有一个“列”太多。所以，我们做一个简单的切片 dates[:,:-1] #For all rows(:,), take all columns except the last one(:-1). #Output: array([[0.9, 0.4, 1. , 0.7, 0.7], [1.1, 0.3, 1. , 0.7, 0.7], [2.1, 0.5, 4. , 0.1, 0.1], [1. , 0.9, 1.1, 0.7, 1. ]]) 现在，我可以通过元素划分计算相对或百分比变化 relative_differences = differences / dates[:,:-1] 然后，和以前一样。选择一个门槛，看看它是否越过 rel_fluctuations = relative_differences > pct_threshold #Output: array([[ True, True, False, False, False], [ True, True, False, False, True], [ True, True, True, False, True], [False, False, False, False, False]]) 现在，如果我们要考虑绝对值或相对阈值中的任何一个都是交叉的，我们只需要取一个按位或<代码> <代码>（甚至在句子中！），然后沿行取和。把所有这些放在一起，我们可以创建一个随时可用的函数。请注意，函数并没有什么特别之处，只是为了便于使用而将代码行组合在一起的一种方式。使用一个函数就像调用它一样简单，你一直在使用函数/方法，却没有意识到这一点假设您希望pct\u changes（） accross-all-columns in a row with a threshold，您也可以尝试axis=1 ： thresh_=0.5 s=pd.to_datetime(df.columns,format='%d/%m',errors='coerce').notna() #all date cols df=df.assign(Count=df.loc[:,s].pct_change(axis=1).abs().gt(0.5).sum(axis=1)) 或： ID域目标标准1/01 1/02 1/03 1/04 1/05 1/06计数 0 A1财务1.0=1.0 0.9 1.1 0.7 1.0 0.6 0 以下是一个更简洁的习惯用法，并对@Paritossingh的版本进行了改进。保留两个单独的数据帧会更干净： timeseries列“ID”、“Domain”、“Target”和“Criteria”的ts （元数据）数据帧 timeseries值（或OP不断调用的“日期”）的数据帧并使用ID 作为这两个数据帧的通用索引，现在您可以无缝合并/连接以及调用compute\u FCounts（）时的任何结果现在不需要传递难看的列名或索引列表（进入compute\u FCounts（））。如评论中所述，这是一种更好的重复数据消除方法。代码在底部这样做会使compute\u FCount 减少到四行（我改进了@paritossingh的版本，使用pandas内置df.diff（axis=1），然后使用pandas.abs（） relative_differences = differences / dates[:,:-1] rel_fluctuations = relative_differences > pct_threshold #Output: array([[ True, True, False, False, False], [ True, True, False, False, True], [ True, True, True, False, True], [False, False, False, False, False]]) date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06'] #if hardcoded. date_columns = df.columns[5:] #if you wish to assign dynamically, and all dates start from 5th column. def get_FCount(df, date_columns, abs_threshold=0.5, pct_threshold=0.5): '''Expects a list of date columns with atleast two values. returns a 1D array, with FCounts for every row. pct_threshold: percentage, where 1 means 100% ''' dates = df[date_columns].values differences = np.abs(np.diff(dates)) abs_fluctuations = differences > abs_threshold rel_fluctuations = differences / dates[:,:-1] > pct_threshold return (abs_fluctuations | rel_fluctuations).sum(-1) #we took a bitwise OR. since we are concerned with values that cross even one of the thresholds. df['FCount'] = get_FCount(df, date_columns) #call our function, and assign the result array to a new column print(df['FCount']) #Output: 0 2 1 3 2 4 3 0 Name: FCount, dtype: int32 thresh_=0.5 s=pd.to_datetime(df.columns,format='%d/%m',errors='coerce').notna() #all date cols df=df.assign(Count=df.loc[:,s].pct_change(axis=1).abs().gt(0.5).sum(axis=1)) df.assign(Count=df.iloc[:,4:].pct_change(axis=1).abs().gt(0.5).sum(axis=1)) ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06 Count 0 A1 Finance 1.0 <= 0.9 0.4 1.0 0.7 0.7 0.9 2 1 B1 IT 2.0 <= 1.1 0.3 1.0 0.7 0.7 1.1 3 2 C1 IT 3.0 >= 2.1 0.5 4.0 0.1 0.1 2.1 4 3 D1 Finance 0.9 >= 1.0 0.9 1.1 0.7 1.0 0.6 0 def compute_FCount_df(dat, abs_threshold=0.5, pct_threshold=0.5): """"""Compute FluctuationCount for all timeseries/rows"""""" differences = dat.diff(axis=1).iloc[:, 1:].abs() abs_fluctuations = differences > abs_threshold rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold return (abs_fluctuations | rel_fluctuations).sum(1) #ts['FCount'] fcounts = compute_FCount_df(values) >>> fcounts A1 2 B1 2 C1 4 D1 1 >>> fcounts.idxmax() 'C1' values.apply(compute_FCount_ts, axis=1, reduce=False) # def compute_FCount_ts(dat, abs_threshold=0.5, pct_threshold=0.5): """Compute FluctuationCount for single timeseries (row)""" differences = dat.diff().iloc[1:].abs() abs_fluctuations = differences > abs_threshold rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold return (abs_fluctuations | rel_fluctuations).sum(1) import pandas as pd import numpy as np ts = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={ 'Domain': ['Finance', 'IT', 'IT', 'Finance'], 'Target': [1, 2, 3, '0.9%'], 'Criteria':['<=', '<=', '>=', '>=']}) values = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={ "1/01":[0.9, 1.1, 2.1, 1], "1/02":[0.4, 0.3, 0.5, 0.9], "1/03":[1, 1, 4, 1.1], "1/04":[0.7, 0.7, 0.1, 0.7], "1/05":[0.7, 0.7, 0.1, 1], "1/06":[0.9, 1.1, 2.1, 0.6]})