Python 计算带有日期头的数据框中数据行的平均值，由'；日期时间'-柱_Python_Pandas_Datetime_Mean

Python 计算带有日期头的数据框中数据行的平均值，由'；日期时间'-柱

python pandas datetime

Python 计算带有日期头的数据框中数据行的平均值，由'；日期时间'-柱,python,pandas,datetime,mean,Python,Pandas,Datetime,Mean,我有一个带有2014-2018年客户ID及其费用的数据框。我想要的是每个ID的费用平均值，但在计算平均值时，只能考虑某个日期之前的年份（因此“日期”列指示哪些列可以考虑平均值）例如：对于指数0（ID:12），日期为“2016-03-08”，则平均值应从“y_2014”和“y_2015”列中选取，因此对于该指数，平均值为111.0。如果日期太早（例如，在本例中为2014年或更早），则应返回NaN（参见索引6和9）初始数据帧： y_2014 y_2015 y_2016 y_2017 y_201

我有一个带有2014-2018年客户ID及其费用的数据框。我想要的是每个ID的费用平均值，但在计算平均值时，只能考虑某个日期之前的年份（因此“日期”列指示哪些列可以考虑平均值）

例如：对于指数0（ID:12），日期为“2016-03-08”，则平均值应从“y_2014”和“y_2015”列中选取，因此对于该指数，平均值为111.0。如果日期太早（例如，在本例中为2014年或更早），则应返回NaN（参见索引6和9）

初始数据帧：

y_2014 y_2015 y_2016 y_2017 y_2018日期ID
0 100.0 122.0 324 632 NaN 2016-03-08 12
1   120.0   159.0      54     452   541.0  2015-04-09  96   
2 NaN 164.0 687 165 245.0 2016-02-15 20
3   180.0   421.0     512     184   953.0  2018-05-01  73  
4   110.0   654.0     913     173   103.0  2017-08-04  84   
5130.0南754124207.0 2016-07-03 26
6   170.0   256.0     843      97   806.0  2013-02-04  87    
7   140.0   754.0      95     101   541.0  2016-06-08  64    
8    80.0   985.0     184      84    90.0  2019-03-05  11  
9    96.0    65.0     127     130   421.0  2014-05-14  34

所需输出：

y_2014 y_2015 y_2016 y_2017 y_2018日期ID平均值
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1   120.0   159.0      54     452   541.0  2015-04-09  96   120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3   180.0   421.0     512     184   953.0  2018-05-01  73  324.25
4   110.0   654.0     913     173   103.0  2017-08-04  84   559.0
5130.0南754124207.0 2016-07-03 26130.0
6170.0256.084397806.013-02-0487南
7   140.0   754.0      95     101   541.0  2016-06-08  64     447
8    80.0   985.0     184      84    90.0  2019-03-05  11   284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34南

尝试代码：->我仍在努力，因为我真的不知道如何开始，到目前为止我只上传了数据帧，可能需要使用“datetime”包来获得所需的数据帧

将熊猫作为pd导入 
将numpy作为np导入 
导入日期时间
df=pd.DataFrame（{“ID”：[12,96,20,73,84,26,87,64,11,34]，  
“YU 2014”：[100120，np.nan，180110130170140,80,96]，    
“y_2015”：[122159164421654，np.nan，256754985,65]，      
“YU 2016”：[324,54687512913754843,95184127]，    
“YU 2017”：[632452165184173124,97101,84130]，    
“YU 2018”：[np.nan，541245953103207806541,90421]，  
“日期”：[‘2016-03-08’、‘2015-04-09’、‘2016-02-15’、‘2018-05-01’、‘2017-08-04’，                           
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']}) 
打印（df）

由于您的命名约定，需要从列名中提取年份以进行比较。然后，您可以屏蔽数据并获取平均值：

# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)

# the years from Date
years = pd.to_datetime(df.Date).dt.year.values

df['mean'] = data.where(data_years<years[:,None]).mean(1)

还有一个答案：

import pandas as pd 
import numpy as np  

df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],                  
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],    
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],                  
               "y_2016": [324,54,687,512,913,754,843,95,184,127],   
               "y_2017": [632,452,165,184,173,124,97,101,84,130],                  
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],   
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',                
                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']] 

#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']  
s = subset.columns[0:].values < df.Date.values[:,None] 
t = s.astype(float)
t[t == 0] = np.nan 

df['mean'] = (subset.iloc[:,0:]*t).mean(1)  
print(df)

#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)  
print(df)

将熊猫作为pd导入 
将numpy作为np导入  
df=pd.DataFrame（{“ID”：[12,96,20,73,84,26,87,64,11,34]，                  
“YU 2014”：[100120，np.nan，180110130170140,80,96]，    
“y_2015”：[122159164421654，np.nan，256754985,65]，                  
“YU 2016”：[324,54687512913754843,95184127]，   
“YU 2017”：[632452165184173124,97101,84130]，                  
“YU 2018”：[np.nan，541245953103207806541,90421]，   
“日期”：[‘2016-03-08’、‘2015-04-09’、‘2016-02-15’、‘2018-05-01’、‘2017-08-04’，                
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#原始df的子集，用于计算平均值
子集=df.loc[：，['y_2014'，'y_2015'，'y_2016'，'y_2017'，'y_2018'] 
#费用值仅在该年过去后用于计算平均值，因此选择2015-01-01作为子集中的“y_2014”列，以与“日期”列进行核对
subset.columns=['2015-01-01'，'2016-01-01'，'2017-01-01'，'2018-01-01'，'2019-01-01']  
s=subset.columns[0:].values

这些数据从何而来？可能需要对“datetime”包做些什么才能获得所需的数据帧？Pandas包括处理日期和时间的功能。这只是一些用于测试操作的示例数据，我无法共享实际数据集；-）好的，我会进一步研究。好的，好的，我会进一步调查的。请这样做，并在有具体问题时报告。哇，你肯定会在这里给出最有用的评论。。这个论坛对各种与代码相关的问题开放。这是一个我非常努力的问题，定义问题，提供一个例子，如果我知道答案，我不会问，所以如果这是你能说的最好的答案，我会非常；）非常感谢。哇，你肯定在这里给出了最有用的评论。。你无权得到答复。如果我知道答案，我不会带着应有的尊重去问，我不确定我是否理解这与我的评论有什么关系，因为我的评论几乎没有粗鲁或煽动性。这个论坛对各种与代码相关的问题开放。这不是一个典型意义上的论坛，仍然有规则和规则

import pandas as pd 
import numpy as np  

df = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],                  
               "y_2014": [100,120,np.nan,180,110,130,170,140,80,96],    
               "y_2015": [122,159,164,421,654,np.nan,256,754,985,65],                  
               "y_2016": [324,54,687,512,913,754,843,95,184,127],   
               "y_2017": [632,452,165,184,173,124,97,101,84,130],                  
               "y_2018": [np.nan,541,245,953,103,207,806,541,90,421],   
                 "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',                
                          '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']] 

#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']  
s = subset.columns[0:].values < df.Date.values[:,None] 
t = s.astype(float)
t[t == 0] = np.nan 

df['mean'] = (subset.iloc[:,0:]*t).mean(1)  
print(df)

#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)  
print(df)