Python 需要将一列拆分为三个不同的箱a-B-C,每个箱中的观测值相同
我的数据框如下所示:Python 需要将一列拆分为三个不同的箱a-B-C,每个箱中的观测值相同,python,pandas,Python,Pandas,我的数据框如下所示: timestamp topAsk topBid CPA midprice CPB spread gamma_perc 72554 2018-11-17 18:43:00 0.00307084 0.00306366 0.00307085 0.00306725 0.00306725 0.00000718 0.000000 35867 2
timestamp topAsk topBid CPA midprice CPB spread gamma_perc
72554 2018-11-17 18:43:00 0.00307084 0.00306366 0.00307085 0.00306725 0.00306725 0.00000718 0.000000
35867 2018-10-23 02:06:00 0.00445542 0.00444528 0.00445542 0.00445035 0.00445035 0.00001014 0.000000
65021 2018-11-12 10:28:00 0.00327366 0.00326954 0.00327160 0.00327160 0.00327160 0.00000412 0.000000
65020 2018-11-12 10:27:00 0.00327246 0.00326834 0.00327100 0.00327040 0.00327040 0.00000412 0.000000
65017 2018-11-12 10:24:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
35872 2018-10-23 02:11:00 0.00445192 0.00444249 0.00445192 0.00444721 0.00444721 0.00000943 0.000000
65016 2018-11-12 10:23:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
65015 2018-11-12 10:22:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
65014 2018-11-12 10:21:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
65013 2018-11-12 10:20:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
... ... ... ... ... ... ... ... ...
82213 2018-11-24 11:43:00 0.00324989 0.00324561 0.00325211 0.00324775 0.00324114 0.00000428 154.439252
88427 2018-11-28 19:17:00 0.00342308 0.00341001 0.00342256 0.00341654 0.00339635 0.00001307 154.475899
63023 2018-11-11 01:10:00 0.00336728 0.00336673 0.00336701 0.00336701 0.00336616 5.5E-7 154.545455
17294 2018-10-10 04:32:00 0.00334544 0.00333056 0.00333802 0.00333800 0.00331500 0.00001488 154.569892
34890 2018-10-22 09:49:00 0.00437069 0.00436719 0.00436894 0.00436894 0.00436353 0.00000350 154.571429
30957 2018-10-19 16:16:00 0.00438949 0.00438403 0.00439011 0.00438676 0.00437832 0.00000546 154.578755
23556 2018-10-14 12:55:00 0.00371373 0.00370981 0.00371279 0.00371177 0.00370571 0.00000392 154.591837
38583 2018-10-24 23:22:00 0.00417979 0.00417406 0.00417915 0.00417692 0.00416806 0.00000573 154.624782
62668 2018-11-10 19:15:00 0.00339415 0.00339102 0.00339259 0.00339259 0.00338775 0.00000313 154.632588
df_new_sample = df_new_sample.sort_values(by='spread')
sorted_array = np.sort(df_new_sample['spread'])
split_spreads = np.array_split(sorted_array, 3)
df_new_sample['spread_bin'] = df_new_sample['spread'].apply(lambda x: 'A' if x <= split_spreads[0][-1] else ( 'B' if split_spreads[0][-1] < x <= split_spreads[1][-1] else 'C'))
spread bin
39478 1E-8 A
42804 1E-8 A
42411 1E-8 A
21897 1E-8 A
27103 1E-8 A
51190 1E-8 A
42452 1E-8 A
42288 1E-8 A
717 1E-8 A
23948 1E-8 A
...
68148 0.00004299 C
76725 0.00004568 C
19495 0.00004706 C
19530 0.00004737 C
77057 0.00004761 C
17368 0.00005202 C
24590 0.00005365 C
19528 0.00006249 C
19489 0.00007012 C
19484 0.00011030 C
我需要做的是添加一个新的列pread_bin
,并将排列观测值排序到大小相同的bin(a、B和C)。到目前为止,我尝试的是对数据帧进行排序,并将其切割成3个阵列,这些阵列将成为我的存储箱,如下所示:
timestamp topAsk topBid CPA midprice CPB spread gamma_perc
72554 2018-11-17 18:43:00 0.00307084 0.00306366 0.00307085 0.00306725 0.00306725 0.00000718 0.000000
35867 2018-10-23 02:06:00 0.00445542 0.00444528 0.00445542 0.00445035 0.00445035 0.00001014 0.000000
65021 2018-11-12 10:28:00 0.00327366 0.00326954 0.00327160 0.00327160 0.00327160 0.00000412 0.000000
65020 2018-11-12 10:27:00 0.00327246 0.00326834 0.00327100 0.00327040 0.00327040 0.00000412 0.000000
65017 2018-11-12 10:24:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
35872 2018-10-23 02:11:00 0.00445192 0.00444249 0.00445192 0.00444721 0.00444721 0.00000943 0.000000
65016 2018-11-12 10:23:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
65015 2018-11-12 10:22:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
65014 2018-11-12 10:21:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
65013 2018-11-12 10:20:00 0.00327756 0.00327341 0.00327548 0.00327548 0.00327548 0.00000415 0.000000
... ... ... ... ... ... ... ... ...
82213 2018-11-24 11:43:00 0.00324989 0.00324561 0.00325211 0.00324775 0.00324114 0.00000428 154.439252
88427 2018-11-28 19:17:00 0.00342308 0.00341001 0.00342256 0.00341654 0.00339635 0.00001307 154.475899
63023 2018-11-11 01:10:00 0.00336728 0.00336673 0.00336701 0.00336701 0.00336616 5.5E-7 154.545455
17294 2018-10-10 04:32:00 0.00334544 0.00333056 0.00333802 0.00333800 0.00331500 0.00001488 154.569892
34890 2018-10-22 09:49:00 0.00437069 0.00436719 0.00436894 0.00436894 0.00436353 0.00000350 154.571429
30957 2018-10-19 16:16:00 0.00438949 0.00438403 0.00439011 0.00438676 0.00437832 0.00000546 154.578755
23556 2018-10-14 12:55:00 0.00371373 0.00370981 0.00371279 0.00371177 0.00370571 0.00000392 154.591837
38583 2018-10-24 23:22:00 0.00417979 0.00417406 0.00417915 0.00417692 0.00416806 0.00000573 154.624782
62668 2018-11-10 19:15:00 0.00339415 0.00339102 0.00339259 0.00339259 0.00338775 0.00000313 154.632588
df_new_sample = df_new_sample.sort_values(by='spread')
sorted_array = np.sort(df_new_sample['spread'])
split_spreads = np.array_split(sorted_array, 3)
df_new_sample['spread_bin'] = df_new_sample['spread'].apply(lambda x: 'A' if x <= split_spreads[0][-1] else ( 'B' if split_spreads[0][-1] < x <= split_spreads[1][-1] else 'C'))
spread bin
39478 1E-8 A
42804 1E-8 A
42411 1E-8 A
21897 1E-8 A
27103 1E-8 A
51190 1E-8 A
42452 1E-8 A
42288 1E-8 A
717 1E-8 A
23948 1E-8 A
...
68148 0.00004299 C
76725 0.00004568 C
19495 0.00004706 C
19530 0.00004737 C
77057 0.00004761 C
17368 0.00005202 C
24590 0.00005365 C
19528 0.00006249 C
19489 0.00007012 C
19484 0.00011030 C
df_new_sample=df_new_sample.sort_值(按class='spread')
排序数组=np.sort(df\u new\u sample['spread'])
split_spreads=np.数组_split(排序数组,3)
df_new_sample['spread_bin']=df_new_sample['spread']。应用(lambda x:'A'如果xPandas具有内置函数cut
执行此操作:
df_new_sample['spread_bin'] = pd.cut(df_new_sample['spread'], 3)
熊猫有一个内置的功能cut
来执行此操作:
df_new_sample['spread_bin'] = pd.cut(df_new_sample['spread'], 3)
你看过吗
您的数据集的边缘恰好有重复项,因此,不可能将数据集存储到大小相同的存储箱中。但是,如果您不关心排列
值是否流入另一个存储箱,则可以人为创建排名列
# first sort your df by `spread`
df_new_sample = df_new_sample.sort_values('spread')
# reset index
df_new_sample = df_new_sample.reset_index(drop=True)
# now qcut on the index
df_new_sample['spread_bin'] = pd.qcut(df_new_sample.index, 3, labels=['A', 'B', 'C']
注意:如果您希望每个箱子中的观察数量相同,则df中的观察数量必须可以被3整除。您看过吗
您的数据集的边缘恰好有重复项,因此,不可能将数据集存储到大小相同的存储箱中。但是,如果您不关心排列
值是否流入另一个存储箱,则可以人为创建排名列
# first sort your df by `spread`
df_new_sample = df_new_sample.sort_values('spread')
# reset index
df_new_sample = df_new_sample.reset_index(drop=True)
# now qcut on the index
df_new_sample['spread_bin'] = pd.qcut(df_new_sample.index, 3, labels=['A', 'B', 'C']
注意:如果您希望每个箱子中的观察数相同,则df中的观察数必须可以被3整除。我对pd.cut有相同的结果,它不会使每个箱子中的箱子具有相同的观察数,我不确定重复的问题。我对pd.cut有相同的结果,它不会产生biN在每个箱子中有相同数量的观察值,我不确定是否有重复的问题。哦,是的,我想这是OP想要的不,它在每个箱子中没有给我相同数量的观察值…@Scratch'N'Purr,我如何使用参数为retbins=True的pd.cut?哦,是的,我想这是OP想要的不,它没有给我same每个箱子中的观察数量…@Scratch'N'Purr,我如何使用参数retbins=True的pd.cut?