Python 计算数据帧列的最快方法
我遇到了一个需要帮助的熊猫问题 一方面,我有一个如下所示的数据帧:Python 计算数据帧列的最快方法,python,pandas,dataframe,Python,Pandas,Dataframe,我遇到了一个需要帮助的熊猫问题 一方面,我有一个如下所示的数据帧: contributor_id timestamp edits upper_month lower_month 0 8 2018-01-01 1 2018-04-01 2018-02-01 1 26424341 2018-01-01 11 2018-04-01 2018-02-01 10
contributor_id timestamp edits upper_month lower_month
0 8 2018-01-01 1 2018-04-01 2018-02-01
1 26424341 2018-01-01 11 2018-04-01 2018-02-01
10 26870381 2018-01-01 465 2018-04-01 2018-02-01
22 28109145 2018-03-01 17 2018-06-01 2018-04-01
23 32769624 2018-01-01 84 2018-04-01 2018-02-01
25 32794352 2018-01-01 4 2018-04-01 2018-02-01
另一方面,我有一个给定的日期索引(在另一个DF中可用):
2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2018-05-01, 2018-06-01, 2018-07-01, 2018-08-01, 2018-09-01, 2018-10-01, 2018-11-01, 2018-12-01.
我需要创建一个pd.Series,它将前面显示的索引作为索引。对于索引中的每个日期,pd.系列的数据必须为:
如果date>=lower\u month and date使用列表理解和扁平化,对转换为元组的压缩列和范围内的值之间的测试成员关系进行扁平化,则在生成器中创建
DataFrame
和sum
:
rng = pd.date_range('2018-01-01', freq='MS', periods=12)
vals = list(zip(df['lower_month'], df['upper_month']))
s = pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
表演:
np.random.seed(123)
def random_dates(start, end, n=10000):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s').floor('d')
d1 = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01')) + pd.offsets.MonthBegin(0)
d2 = random_dates(pd.to_datetime('2018-01-01'), pd.to_datetime('2020-01-01')) + pd.offsets.MonthBegin(0)
df = pd.DataFrame({'lower_month':d1, 'upper_month':d2})
rng = pd.date_range('2015-01-01', freq='MS', periods=6 * 12)
vals = list(zip(df['lower_month'], df['upper_month']))
In [238]: %timeit pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})
158 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [239]: %timeit pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
221 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#first solution is slow
In [240]: %timeit pd.DataFrame([(y, y >= x1 and y <= x2) for x1, x2 in vals for y in rng], columns=['d','test']).groupby('d')['test'].sum().astype(int)
4.52 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.random.seed(123)
def随机_日期(开始、结束,n=10000):
start\u=start.value//10**9
end\u=end.value//10**9
返回pd.to_datetime(np.random.randint(start_,end_,n),unit='s')。floor('d'))
d1=随机日期(pd至日期时间('2015-01-01')、pd至日期时间('2018-01-01'))+pd偏移量。蒙特贝金(0)
d2=随机日期(pd至日期时间('2018-01-01')、pd至日期时间('2020-01-01'))+pd偏移量。蒙特贝金(0)
df=pd.DataFrame({'lower_month':d1,'upper_month':d2})
rng=pd.日期范围('2015-01-01',频率=MS',周期=6*12)
VAL=列表(zip(df[‘下半月’],df[‘上半月’))
在[238]:%timeit pd.系列({y:[y>=x1和y=x1和y=x1和y中,我使用itertools为每个索引日期重复上个月和下个月
然后比较每个上下月的索引日期,并设置临时列
check=1
然后按索引日期分组后进行总和检查
import pandas as pd
from pandas.compat import StringIO, BytesIO
import itertools
#sample data
data = ('contributor_id,timestamp,edits,upper_month,lower_month\n'
'8,2018-01-01,1,2018-04-01,2018-02-01\n'
'26424341,2018-01-01,11,2018-04-01,2018-02-01\n'
'26870381,2018-02-01,465,2018-04-01,2018-02-01\n'
'28109145,2018-03-01,17,2018-06-01,2018-04-01\n')
orig_df = pd.read_csv(StringIO(data))
# sample index_dates
index_df = list(pd.Series(["2018-01-01", "2018-02-01"]))
# repeat upper_month and lower_month using itertools.product
abc = list(orig_df[['upper_month','lower_month']].values)
combine_list = [index_df,abc]
res = list(itertools.product(*combine_list))
df = pd.DataFrame(res,columns=["timestamp","range"])
#separate lower_month and upper_month from range
df['lower_month'] = df['range'].apply(lambda x : x[1])
df['upper_month'] = df['range'].apply(lambda x : x[0])
df.drop(['range'],axis=1,inplace=True)
# convert all dates column to make them consistent
orig_df['timestamp'] = pd.to_datetime(orig_df['timestamp']).dt.date.astype(str)
orig_df['upper_month'] = pd.to_datetime(orig_df['upper_month']).dt.date.astype(str)
orig_df['lower_month'] = pd.to_datetime(orig_df['lower_month']).dt.date.astype(str)
#apply condition to set check 1
df.loc[(df["timestamp"]>=df['lower_month']) & (df["timestamp"]<=df['upper_month']),"check"] = 1
#simply groupby to count check
res = df.groupby(['timestamp'])['check'].sum()
print(res)
你的数据帧的大小是多少?这确实有效。你能解释一下s=pd.DataFrame([(y,y>=x1和y)行的确切含义吗+1@HRDSL-添加了另一个更好的解决方案。不会pd.Series({y:[y>=x1和y@Stef-I可以测试它。
np.random.seed(123)
def random_dates(start, end, n=10000):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s').floor('d')
d1 = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01')) + pd.offsets.MonthBegin(0)
d2 = random_dates(pd.to_datetime('2018-01-01'), pd.to_datetime('2020-01-01')) + pd.offsets.MonthBegin(0)
df = pd.DataFrame({'lower_month':d1, 'upper_month':d2})
rng = pd.date_range('2015-01-01', freq='MS', periods=6 * 12)
vals = list(zip(df['lower_month'], df['upper_month']))
In [238]: %timeit pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})
158 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [239]: %timeit pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
221 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#first solution is slow
In [240]: %timeit pd.DataFrame([(y, y >= x1 and y <= x2) for x1, x2 in vals for y in rng], columns=['d','test']).groupby('d')['test'].sum().astype(int)
4.52 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import pandas as pd
from pandas.compat import StringIO, BytesIO
import itertools
#sample data
data = ('contributor_id,timestamp,edits,upper_month,lower_month\n'
'8,2018-01-01,1,2018-04-01,2018-02-01\n'
'26424341,2018-01-01,11,2018-04-01,2018-02-01\n'
'26870381,2018-02-01,465,2018-04-01,2018-02-01\n'
'28109145,2018-03-01,17,2018-06-01,2018-04-01\n')
orig_df = pd.read_csv(StringIO(data))
# sample index_dates
index_df = list(pd.Series(["2018-01-01", "2018-02-01"]))
# repeat upper_month and lower_month using itertools.product
abc = list(orig_df[['upper_month','lower_month']].values)
combine_list = [index_df,abc]
res = list(itertools.product(*combine_list))
df = pd.DataFrame(res,columns=["timestamp","range"])
#separate lower_month and upper_month from range
df['lower_month'] = df['range'].apply(lambda x : x[1])
df['upper_month'] = df['range'].apply(lambda x : x[0])
df.drop(['range'],axis=1,inplace=True)
# convert all dates column to make them consistent
orig_df['timestamp'] = pd.to_datetime(orig_df['timestamp']).dt.date.astype(str)
orig_df['upper_month'] = pd.to_datetime(orig_df['upper_month']).dt.date.astype(str)
orig_df['lower_month'] = pd.to_datetime(orig_df['lower_month']).dt.date.astype(str)
#apply condition to set check 1
df.loc[(df["timestamp"]>=df['lower_month']) & (df["timestamp"]<=df['upper_month']),"check"] = 1
#simply groupby to count check
res = df.groupby(['timestamp'])['check'].sum()
print(res)
timestamp
2018-01-01 0.0
2018-02-01 3.0