Python 根据不同的行和列拆分行
我非常感谢你在这方面的帮助 我有一张表,上面有产品、日期和金额。这就是初始表的外观Python 根据不同的行和列拆分行,python,pandas,Python,Pandas,我非常感谢你在这方面的帮助 我有一张表,上面有产品、日期和金额。这就是初始表的外观 Product ID goliveyear endyear Revenue 1 2020-10 2022-02 90 1 2020-10 2022-02 140 1 2020-10 2022-02 60 其目的是将每一行划
Product ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
其目的是将每一行划分为年底前剩余的月数
如果是第一年,则从第一年的月份开始拆分,直到年底
如果该年为年末,则拆分至年末的月份。由于第一个表中的收入指的是整个期间,因此需要根据当月的行数拆分收入。
其间的所有年份将被划分为12行,每个月收入一行
Product ID goliveyear endyear Year Month Revenue
1 2020-10 2022-02 2020 10 90/3=30
1 2020-10 2022-02 2020 11 30
1 2020-10 2022-02 2020 12 30
1 2020-10 2022-02 2021 01 140/12 =11.67
1 2020-10 2022-02 2021 02 11.67
1 2020-10 2022-02 2021 03 11.67
1 2020-10 2022-02 2021 04 11.67
... ... ... ... ... ...
1 2020-10 2022-02 2022 01 60/2 = 30
1 2020-10 2022-02 2022 02 30
非常感谢大家。试试这个:
import pandas as pd
from io import StringIO
s = """
Product ID,goliveyear,endyear,Revenue
1,2020-10,2022-02,90
1,2020-10,2022-02,140
1,2020-10,2022-02,60"""
df = pd.read_csv(StringIO(s))
# generate the months list between thest two months
df['rng'] = df.apply(lambda x: pd.date_range(x['goliveyear'], x['endyear'],
freq='MS'), axis=1)
# explode the dataframe by months list
df_exploded = df.explode('rng')
df_exploded['Year'] = df_exploded['rng'].dt.year
df_exploded['Month'] = df_exploded['rng'].dt.month
# the (index,year) pair to fliter rows
filter_year = list(zip(df.index, df_exploded.Year.unique()))
# used columns
use_cols = ['Product ID', 'goliveyear', 'endyear', 'Revenue', 'Month']
# filter rows
df_filter = df_exploded.set_index([df_exploded.index,
df_exploded.Year]).loc[filter_year,
use_cols].reset_index().drop(columns='level_0')
# calculate the average Revenue
result = df_filter.set_index(['Year', "Month"]).assign(
Revenue=(df_filter.groupby(['Year', 'Month'])['Revenue'].sum() /
df_filter.groupby('Year')['Month'].count())
).reset_index()
result
输出
Year Month Product ID goliveyear endyear Revenue
0 2020 10 1 2020-10 2022-02 30.000000
1 2020 11 1 2020-10 2022-02 30.000000
2 2020 12 1 2020-10 2022-02 30.000000
3 2021 1 1 2020-10 2022-02 11.666667
4 2021 2 1 2020-10 2022-02 11.666667
5 2021 3 1 2020-10 2022-02 11.666667
6 2021 4 1 2020-10 2022-02 11.666667
7 2021 5 1 2020-10 2022-02 11.666667
8 2021 6 1 2020-10 2022-02 11.666667
9 2021 7 1 2020-10 2022-02 11.666667
10 2021 8 1 2020-10 2022-02 11.666667
11 2021 9 1 2020-10 2022-02 11.666667
12 2021 10 1 2020-10 2022-02 11.666667
13 2021 11 1 2020-10 2022-02 11.666667
14 2021 12 1 2020-10 2022-02 11.666667
15 2022 1 1 2020-10 2022-02 30.000000
16 2022 2 1 2020-10 2022-02 30.000000
相当多的步骤
从设置df开始
from io import StringIO
import pandas as pd
from datetime import datetime,timedelta
df = pd.read_csv(StringIO(
"""
Product_ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
"""), delim_whitespace=True)
df['goliveyear'] = pd.to_datetime(df['goliveyear'])
df['endyear'] = pd.to_datetime(df['endyear'])
df
然后添加年开始、年结束、期间开始、期间结束列
df['ys'] = df['goliveyear'].dt.year + df.groupby('Product_ID').cumcount()
df['ye'] = df['ys'] + 1
df['ys'] = pd.to_datetime(df['ys'], format = '%Y')
df['ye'] = pd.to_datetime(df['ye'], format = '%Y')+ timedelta(days=-1)
df['ps'] = df[['goliveyear','ys']].max(axis=1)
df['pe'] = df[['endyear','ye']].min(axis=1)
产生
Product_ID goliveyear endyear Revenue ys ye ps pe
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- -------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00
然后首先将月份
添加为列表
df['months'] = df.apply(lambda r: [d.month for d in pd.date_range(r['ps'], r['pe'], freq='MS', closed = None).to_pydatetime()], axis=1)
输出:
Product_ID goliveyear endyear Revenue ys ye ps pe months
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- ------------------- ---------------------------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00 [10, 11, 12]
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00 [1, 2]
然后,我们分解月份
对收入进行必要的计算,并删除不需要的列
df = df.explode('months')
df['Revenue'] = df['Revenue'] / df.groupby(['Product_ID','ys'])['months'].transform('count')
df = df.drop(columns = ['goliveyear','endyear','ye','ps','pe'])
df['ys'] = df['ys'].dt.year
得到
Product_ID Revenue ys months
-- ------------ --------- ---- --------
0 1 30 2020 10
0 1 30 2020 11
0 1 30 2020 12
1 1 11.6667 2021 1
1 1 11.6667 2021 2
1 1 11.6667 2021 3
1 1 11.6667 2021 4
1 1 11.6667 2021 5
1 1 11.6667 2021 6
1 1 11.6667 2021 7
1 1 11.6667 2021 8
1 1 11.6667 2021 9
1 1 11.6667 2021 10
1 1 11.6667 2021 11
1 1 11.6667 2021 12
2 1 30 2022 1
2 1 30 2022 2
我认为你需要给出你的试用码来解决你的问题。