Pandas 如何高效地存储数据帧,然后在这些存储单元上执行groupby操作?
我想通过这些存储桶存储卷,并在聚合数据上构建一个摘要报告。目前我使用apply来实现这一点,但是apply对于大型数据集来说可能非常慢。create\u lt\u ten\u bucket中是否给出了语法的一般形式?我猜这更像是一件小事,我不太熟悉Pandas 如何高效地存储数据帧,然后在这些存储单元上执行groupby操作?,pandas,numpy,pandas-groupby,Pandas,Numpy,Pandas Groupby,我想通过这些存储桶存储卷,并在聚合数据上构建一个摘要报告。目前我使用apply来实现这一点,但是apply对于大型数据集来说可能非常慢。create\u lt\u ten\u bucket中是否给出了语法的一般形式?我猜这更像是一件小事,我不太熟悉 def create_buckets(df_internal, comparison_operator, column_to_bucket, min_value, max_value, ranges_pivots): low = [min_v
def create_buckets(df_internal, comparison_operator, column_to_bucket, min_value, max_value, ranges_pivots):
low = [min_value] + ranges_pivots
high = ranges_pivots + [max_value]
ranges = list(zip(low, high))
max_str_len = len(str(max(high + low)))
def get_value(row):
count = 0
for l,h in ranges:
if comparison_operator(l, row[column_to_bucket]) and comparison_operator(row[column_to_bucket], h):
return "{}|{}_to_{}".format(str(count).zfill(max_str_len),l,h)
count+=1
return "OUTOFBAND"
df_internal["{}_BUCKETED".format(column_to_bucket)] = df_internal.apply(get_value, axis=1)
def create_lt_ten_bucket(df_internal, column_to_bucket):
df_internal["{}_is_lt_ten".format(column_to_bucket)] = df_internal[column_to_bucket] < 10
dftest = pd.DataFrame([1,2,3,4,5, 44, 250, 22], columns=["value_alpha"])
create_buckets(dftest, lambda v1,v2: v1 <= v2, "value_alpha", 0, 999, [1, 2, 5, 10, 25, 50, 100, 200])
display(dftest)
create_lt_ten_bucket(dftest, "value_alpha")
display(dftest)
dftest.groupby('value_alpha_BUCKETED').sum().sort_values('value_alpha_BUCKETED')
最后,我试图得到一个与此类似的数据摘要:
dftest.groupby('value\u alpha\u BUCKETED').sum().sort\u values('value\u alpha\u BUCKETED')
我不完全清楚你在问什么,但你大概有
pd.cut
和pd.DataFrame.groupby
:
dftest['new_bucket'] = pd.cut(dftest['value_alpha'], [0, 1, 2, 5, 10, 25, 50, 100, 200, 999])
dftest['value_alpha_is_lt_ten'] = dftest['value_alpha'] < 10
print(dftest.groupby("new_bucket").sum())
value_alpha value_alpha_is_lt_ten
new_bucket
(0, 1] 1 1.0
(1, 2] 2 1.0
(2, 5] 12 3.0
(5, 10] 0 0.0
(10, 25] 22 0.0
(25, 50] 44 0.0
(50, 100] 0 0.0
(100, 200] 0 0.0
(200, 999] 250 0.0
dftest['new_bucket']=pd.cut(dftest['value_alpha'],[0,1,2,5,10,25,50100200999])
dftest['value\u alpha\u is\u lt\u ten']=dftest['value\u alpha']<10
打印(dftest.groupby(“new_bucket”).sum())
值α值α是十
新水桶
(0, 1] 1 1.0
(1, 2] 2 1.0
(2, 5] 12 3.0
(5, 10] 0 0.0
(10, 25] 22 0.0
(25, 50] 44 0.0
(50, 100] 0 0.0
(100, 200] 0 0.0
(200, 999] 250 0.0
如果您不想要空桶,您可以。查询值0
dftest = pd.DataFrame([1,2,3,4,5, 44, 250, 22], columns=["value_alpha"])
create_buckets(dftest, lambda v1,v2: v1 <= v2, "value_alpha", 0, 999999999, [1, 2, 5, 10, 25, 50, 100, 200])
display(dftest)
create_lt_ten_bucket(dftest, "value_alpha")
display(dftest)
OUTPUT
value_alpha value_alpha_BUCKETED value_alpha_is_lt_ten
0 1 000|0_to_1 True
1 2 001|1_to_2 True
2 3 002|2_to_5 True
3 4 002|2_to_5 True
4 5 002|2_to_5 True
5 44 005|25_to_50 False
6 250 008|200_to_999 False
7 22 004|10_to_25 False
value_alpha value_alpha_is_lt_ten
value_alpha_BUCKETED
000|0_to_1 1 1.0
001|1_to_2 2 1.0
002|2_to_5 12 3.0
004|10_to_25 22 0.0
005|25_to_50 44 0.0
008|200_to_999 250 0.0
dftest['new_bucket'] = pd.cut(dftest['value_alpha'], [0, 1, 2, 5, 10, 25, 50, 100, 200, 999])
dftest['value_alpha_is_lt_ten'] = dftest['value_alpha'] < 10
print(dftest.groupby("new_bucket").sum())
value_alpha value_alpha_is_lt_ten
new_bucket
(0, 1] 1 1.0
(1, 2] 2 1.0
(2, 5] 12 3.0
(5, 10] 0 0.0
(10, 25] 22 0.0
(25, 50] 44 0.0
(50, 100] 0 0.0
(100, 200] 0 0.0
(200, 999] 250 0.0