时间序列上的Python聚合_Python_Python 3.x_Python 2.7_Pandas_Numpy

时间序列上的Python聚合

python python-3.x python-2.7 pandas numpy

时间序列上的Python聚合,python,python-3.x,python-2.7,pandas,numpy,Python,Python 3.x,Python 2.7,Pandas,Numpy,我有一个这样的数据帧df project_ID country prj_start prj_end revenue profit 2131 USA 201603 201703 100000 30000 5124 UK 201502 201606 1500 1000 1245 UK 201010 201710 1800 1000 Month

我有一个这样的数据帧df

project_ID country   prj_start  prj_end  revenue   profit
 2131      USA       201603     201703   100000     30000
 5124      UK        201502     201606   1500       1000 
 1245      UK        201010     201710   1800       1000

Month   country   active_projects   revenue profit
201603  USA         15            500000  100000
201603  UK          20            150000  100000
201604  Germany     30            1000000 500000

我想找出每个月和每个国家的活动项目数量，以及它们的收入和利润之和。输出如下所示

project_ID country   prj_start  prj_end  revenue   profit
 2131      USA       201603     201703   100000     30000
 5124      UK        201502     201606   1500       1000 
 1245      UK        201010     201710   1800       1000

Month   country   active_projects   revenue profit
201603  USA         15            500000  100000
201603  UK          20            150000  100000
201604  Germany     30            1000000 500000

<>我的第一个编程语言是C++，所以我倾向于使用循环来做事情。我几乎成功地找到了一个解决方案，我创建了这样的月份槽

#making a monthlist dataframe with count column to hold no. of active projects
monthlist = pd.DataFrame(columns= ["months","count"])

#making a new dataframe to insert the results into
newdf = pd.DataFrame(columns=["month", "country","active_prj_count","rev","gp"])
#making the month slots, not concerned with future values
monthlist['months']=pd.date_range(start = min(df['prj_start']), end =datetime.date.today(), freq='M').map(lambda x: 100*x.year + x.month)
monthlist['count']=0

#traversing through the original dataframe and monthlist to insert a new row into newdf 

#everytime the project start is less than and prj end is greater than the month slot
i=0
for y in range(len(df)):
    for x in range(len(monthlist)):
        if(df.loc[y,'prj_start']<=monthlist.loc[x,'months'] & df.loc[y,'prj_end']>=monthlist.loc[x,'months']):
            monthlist.loc[x,'count']=monthlist.loc[x,'count']+1
            newdf.loc[i] = [monthlist.loc[x,'months'],df.loc[y,'country']
                                 ,monthlist.loc[x,'count'],df.loc[y,'revenue'],df.loc[y,'profit']]
            i=i+1

#制作带有count列的monthlist数据框以保存活动项目的数量
monthlist=pd.DataFrame（列=[“月”，“计数”]）
#创建新的数据框以将结果插入
newdf=pd.DataFrame（列=[“月”、“国家”、“活动项目数”、“修订”、“总计划”]）
#制作月份时段，与未来值无关
monthlist['months']=pd.date_range（开始=min（df['prj_start']），结束=datetime.date.today（），频率=M'）.map（λx:100*x.year+x.month）
月列表['count']=0
#遍历原始dataframe和monthlist，在newdf中插入新行
#每次项目开始时间小于且prj结束时间大于月时段
i=0
对于范围内的y（len（df））：
对于范围内的x（len（月列表））：
如果（df.loc[y，'prj_start']=monthlist.loc[x，'months']）：
monthlist.loc[x，'count']=monthlist.loc[x，'count']+1
newdf.loc[i]=[monthlist.loc[x，'months']，df.loc[y，'country']
，monthlist.loc[x，'count']，df.loc[y，'revenue']，df.loc[y，'profit']]
i=i+1

这个解决方案是可行的，但我必须承认它不是很聪明，计算效率也不高。需要一段时间来处理。有人想通过使用pandas或numpy函数来改进代码吗？

好的，类似这样的东西怎么样（取决于您如何计算每月利润，仅举一个例子）：

您可以将函数应用于每一行，提取每个项目所在的日期，然后按月份和国家进行汇总

>>> df 

   project_ID country  prj_start  prj_end  revenue  profit
0        2131     USA     201603   201703   100000   30000
1        5124      UK     201502   201606     1500    1000
2        1245      UK     201010   201710     1800    1000

让我们添加一些更多的样本，以便每月获得不同的国家/地区：

>>>  df_new = pd.DataFrame([
                [1111, 'Germany',201603, 201703,1000, 4000],
                [4111, 'Germany',201603, 201703,4000, 6000],
                [3112, 'Germany',201010, 201703,4000, 6000],
                [2112, 'Germany',201603, 201703,4000, 6000],
                [2116, 'Germany',201502, 201710,4000, 6000]],
                columns=df.columns)

>>> df_new

   project_ID  country  prj_start  prj_end  revenue  profit
0        1111  Germany     201603   201703     1000    4000
1        4111  Germany     201603   201703     4000    6000
2        3112  Germany     201010   201703     4000    6000
3        2112  Germany     201603   201703     4000    6000
4        2116  Germany     201502   201710     4000    6000

>>> df_ = pd.concat([df,df_new],axis=0,ignore_index=True)

   project_ID  country  prj_start  prj_end  revenue  profit
0        2131      USA     201603   201703   100000   30000
1        5124       UK     201502   201606     1500    1000
2        1245       UK     201010   201710     1800    1000
3        1111  Germany     201603   201703     1000    4000
4        4111  Germany     201603   201703     4000    6000
5        3112  Germany     201010   201703     4000    6000
6        2112  Germany     201603   201703     4000    6000
7        2116  Germany     201502   201710     4000    6000

将

prj_start

和

prj_end

转换为datetime，并指示要分析的格式

format=“%Y%m”

：

>>> df_[['prj_start','prj_end']] =  df_[['prj_start','prj_end']].apply(pd.to_datetime, format="%Y%m")

>>> df_ 

   project_ID  country  prj_start    prj_end  revenue  profit
0        2131      USA 2016-03-01 2017-03-01   100000   30000
1        5124       UK 2015-02-01 2016-06-01     1500    1000
2        1245       UK 2010-10-01 2017-10-01     1800    1000
3        1111  Germany 2016-03-01 2017-03-01     1000    4000
4        4111  Germany 2016-03-01 2017-03-01     4000    6000
5        3112  Germany 2010-10-01 2017-03-01     4000    6000
6        2112  Germany 2016-03-01 2017-03-01     4000    6000
7        2116  Germany 2015-02-01 2017-10-01     4000    6000

现在，让我们定义一个函数来转换行并应用它：

def transform_row(row):
    date_index = pd.date_range(row['prj_start'].min(),
                               row['prj_end'].max(), freq='MS') 

    row_out = pd.DataFrame(np.repeat(row.values, 
                                     len(date_index.values),axis=0), 
                           index=date_index, columns=row.columns)
    row_out.index.name = 'date'
    return row_out.reset_index()

df_transformed = pd.concat([transform_row(row.to_frame().T) 
                            for i,row in df_.iterrows()],axis=0)

然后，最后应用

pivot_table

按国家和日期聚合值：

df1 = pd.pivot_table(df_transformed, 
                     index=['date','country'],
                     values=['revenue','profit'],
                     aggfunc=np.sum,fill_value=0)

df2 = pd.pivot_table(df_transformed,
                     index=['date','country'],
                     values=['project_ID'],
                     aggfunc=len,fill_value=0)

最后，连接datafame以按月获取数据：

pd.concat([df1,df2],axis=1)

                    profit  revenue  project_ID
date       country                             
2010-10-01 Germany    6000     4000           1
           UK         1000     1800           1
2010-11-01 Germany    6000     4000           1
           UK         1000     1800           1
2010-12-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-01-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-02-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-03-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-04-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-05-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-06-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-07-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-08-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-09-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-10-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-11-01 Germany    6000     4000           1
           UK         1000     1800           1
2011-12-01 Germany    6000     4000           1
           UK         1000     1800           1
...                    ...      ...         ...
2016-10-01 USA       30000   100000           1
2016-11-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2016-12-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-01-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-02-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-03-01 Germany   28000    17000           5
           UK         1000     1800           1
           USA       30000   100000           1
2017-04-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-05-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-06-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-07-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-08-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-09-01 Germany    6000     4000           1
           UK         1000     1800           1
2017-10-01 Germany    6000     4000           1
           UK         1000     1800           1

您的数据帧有多大？如果不是很大，一个解决方案是创建一个新列，其中包含开始和结束之间所有月份的列表。然后展开dataframe，以便每个项目的每个月都有一个单独的行。然后，最后做一个简单的分组。你的月名真的那么不规则吗？你似乎有一个缩写和写出来的名字混合。我不明白为什么会有这样的顺序（你似乎将它们与

@Graipher进行比较，我仅通过map函数将日期转换为包含月份和年份的整数。这使得比较更简单。因此实际上月份类似于201005、201507（yyyyymm）。我将在原始帖子中对其进行编辑。这将为您提供大部分帮助，但您需要考虑跨越多个月的项目。df.groupby（['pr_start'，'country']）.agg（{'projectid'：'count'，'revenue'：'sum'，'profit'：'sum'}）。重命名（列={'projectid'：'activeprojects'}）
我刚刚意识到我所说的与@jp_data_analysis基本上是一样的。一旦你将每个月分成不同的行，然后运行我展示的groupby。你就快到了！感谢你的关注。但我认为你没有正确理解这个问题。一个项目在开始日期之间的所有月份都处于活动状态e和结束日期。因此，我想计算每个国家每月活动的所有项目，并分别汇总其财务数据。希望这能澄清……例如，在您创建的数据框架中。201604应该为德国的活动项目计算4。不要担心为财务数据创建月值，我已经这样做了。只需要弄清楚剩下的部分。现在看一看。如果数据框不是太大，它应该可以工作。df_分组数据框是什么？或者它只是df_转换的一个拼写错误？工作得很有魅力！你是我的个人英雄。如果可以的话，我会投两次赞成票：）总是很乐意帮助：）