Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 需要将一列拆分为三个不同的箱a-B-C,每个箱中的观测值相同_Python_Pandas - Fatal编程技术网

Python 需要将一列拆分为三个不同的箱a-B-C,每个箱中的观测值相同

Python 需要将一列拆分为三个不同的箱a-B-C,每个箱中的观测值相同,python,pandas,Python,Pandas,我的数据框如下所示: timestamp topAsk topBid CPA midprice CPB spread gamma_perc 72554 2018-11-17 18:43:00 0.00307084 0.00306366 0.00307085 0.00306725 0.00306725 0.00000718 0.000000 35867 2

我的数据框如下所示:

            timestamp              topAsk       topBid     CPA          midprice    CPB      spread   gamma_perc
    72554   2018-11-17 18:43:00 0.00307084  0.00306366  0.00307085  0.00306725  0.00306725  0.00000718  0.000000
    35867   2018-10-23 02:06:00 0.00445542  0.00444528  0.00445542  0.00445035  0.00445035  0.00001014  0.000000
    65021   2018-11-12 10:28:00 0.00327366  0.00326954  0.00327160  0.00327160  0.00327160  0.00000412  0.000000
    65020   2018-11-12 10:27:00 0.00327246  0.00326834  0.00327100  0.00327040  0.00327040  0.00000412  0.000000
    65017   2018-11-12 10:24:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    35872   2018-10-23 02:11:00 0.00445192  0.00444249  0.00445192  0.00444721  0.00444721  0.00000943  0.000000
    65016   2018-11-12 10:23:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    65015   2018-11-12 10:22:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    65014   2018-11-12 10:21:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    65013   2018-11-12 10:20:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    ... ... ... ... ... ... ... ... ...
    82213   2018-11-24 11:43:00 0.00324989  0.00324561  0.00325211  0.00324775  0.00324114  0.00000428  154.439252
    88427   2018-11-28 19:17:00 0.00342308  0.00341001  0.00342256  0.00341654  0.00339635  0.00001307  154.475899
    63023   2018-11-11 01:10:00 0.00336728  0.00336673  0.00336701  0.00336701  0.00336616  5.5E-7  154.545455
    17294   2018-10-10 04:32:00 0.00334544  0.00333056  0.00333802  0.00333800  0.00331500  0.00001488  154.569892
    34890   2018-10-22 09:49:00 0.00437069  0.00436719  0.00436894  0.00436894  0.00436353  0.00000350  154.571429
    30957   2018-10-19 16:16:00 0.00438949  0.00438403  0.00439011  0.00438676  0.00437832  0.00000546  154.578755
    23556   2018-10-14 12:55:00 0.00371373  0.00370981  0.00371279  0.00371177  0.00370571  0.00000392  154.591837
    38583   2018-10-24 23:22:00 0.00417979  0.00417406  0.00417915  0.00417692  0.00416806  0.00000573  154.624782
    62668   2018-11-10 19:15:00 0.00339415  0.00339102  0.00339259  0.00339259  0.00338775  0.00000313  154.632588
df_new_sample = df_new_sample.sort_values(by='spread')
sorted_array = np.sort(df_new_sample['spread'])
split_spreads = np.array_split(sorted_array, 3)

df_new_sample['spread_bin'] = df_new_sample['spread'].apply(lambda x: 'A' if x <= split_spreads[0][-1] else ( 'B' if split_spreads[0][-1] < x <= split_spreads[1][-1] else 'C'))

spread                bin
39478          1E-8   A
42804          1E-8   A
42411          1E-8   A
21897          1E-8   A
27103          1E-8   A
51190          1E-8   A
42452          1E-8   A
42288          1E-8   A
717            1E-8   A
23948          1E-8   A
            ...    
68148    0.00004299   C
76725    0.00004568   C
19495    0.00004706   C
19530    0.00004737   C
77057    0.00004761   C
17368    0.00005202   C
24590    0.00005365   C
19528    0.00006249   C
19489    0.00007012   C
19484    0.00011030   C
我需要做的是添加一个新的列
pread_bin
,并将排列观测值排序到大小相同的bin(a、B和C)。到目前为止,我尝试的是对数据帧进行排序,并将其切割成3个阵列,这些阵列将成为我的存储箱,如下所示:

            timestamp              topAsk       topBid     CPA          midprice    CPB      spread   gamma_perc
    72554   2018-11-17 18:43:00 0.00307084  0.00306366  0.00307085  0.00306725  0.00306725  0.00000718  0.000000
    35867   2018-10-23 02:06:00 0.00445542  0.00444528  0.00445542  0.00445035  0.00445035  0.00001014  0.000000
    65021   2018-11-12 10:28:00 0.00327366  0.00326954  0.00327160  0.00327160  0.00327160  0.00000412  0.000000
    65020   2018-11-12 10:27:00 0.00327246  0.00326834  0.00327100  0.00327040  0.00327040  0.00000412  0.000000
    65017   2018-11-12 10:24:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    35872   2018-10-23 02:11:00 0.00445192  0.00444249  0.00445192  0.00444721  0.00444721  0.00000943  0.000000
    65016   2018-11-12 10:23:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    65015   2018-11-12 10:22:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    65014   2018-11-12 10:21:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    65013   2018-11-12 10:20:00 0.00327756  0.00327341  0.00327548  0.00327548  0.00327548  0.00000415  0.000000
    ... ... ... ... ... ... ... ... ...
    82213   2018-11-24 11:43:00 0.00324989  0.00324561  0.00325211  0.00324775  0.00324114  0.00000428  154.439252
    88427   2018-11-28 19:17:00 0.00342308  0.00341001  0.00342256  0.00341654  0.00339635  0.00001307  154.475899
    63023   2018-11-11 01:10:00 0.00336728  0.00336673  0.00336701  0.00336701  0.00336616  5.5E-7  154.545455
    17294   2018-10-10 04:32:00 0.00334544  0.00333056  0.00333802  0.00333800  0.00331500  0.00001488  154.569892
    34890   2018-10-22 09:49:00 0.00437069  0.00436719  0.00436894  0.00436894  0.00436353  0.00000350  154.571429
    30957   2018-10-19 16:16:00 0.00438949  0.00438403  0.00439011  0.00438676  0.00437832  0.00000546  154.578755
    23556   2018-10-14 12:55:00 0.00371373  0.00370981  0.00371279  0.00371177  0.00370571  0.00000392  154.591837
    38583   2018-10-24 23:22:00 0.00417979  0.00417406  0.00417915  0.00417692  0.00416806  0.00000573  154.624782
    62668   2018-11-10 19:15:00 0.00339415  0.00339102  0.00339259  0.00339259  0.00338775  0.00000313  154.632588
df_new_sample = df_new_sample.sort_values(by='spread')
sorted_array = np.sort(df_new_sample['spread'])
split_spreads = np.array_split(sorted_array, 3)

df_new_sample['spread_bin'] = df_new_sample['spread'].apply(lambda x: 'A' if x <= split_spreads[0][-1] else ( 'B' if split_spreads[0][-1] < x <= split_spreads[1][-1] else 'C'))

spread                bin
39478          1E-8   A
42804          1E-8   A
42411          1E-8   A
21897          1E-8   A
27103          1E-8   A
51190          1E-8   A
42452          1E-8   A
42288          1E-8   A
717            1E-8   A
23948          1E-8   A
            ...    
68148    0.00004299   C
76725    0.00004568   C
19495    0.00004706   C
19530    0.00004737   C
77057    0.00004761   C
17368    0.00005202   C
24590    0.00005365   C
19528    0.00006249   C
19489    0.00007012   C
19484    0.00011030   C
df_new_sample=df_new_sample.sort_值(按class='spread')
排序数组=np.sort(df\u new\u sample['spread'])
split_spreads=np.数组_split(排序数组,3)

df_new_sample['spread_bin']=df_new_sample['spread']。应用(lambda x:'A'如果xPandas具有内置函数
cut
执行此操作:

df_new_sample['spread_bin'] = pd.cut(df_new_sample['spread'], 3)

熊猫有一个内置的功能
cut
来执行此操作:

df_new_sample['spread_bin'] = pd.cut(df_new_sample['spread'], 3)
你看过吗

您的数据集的边缘恰好有重复项,因此,不可能将数据集存储到大小相同的存储箱中。但是,如果您不关心
排列
值是否流入另一个存储箱,则可以人为创建排名列

# first sort your df by `spread`
df_new_sample = df_new_sample.sort_values('spread')

# reset index
df_new_sample = df_new_sample.reset_index(drop=True)

# now qcut on the index
df_new_sample['spread_bin'] = pd.qcut(df_new_sample.index, 3, labels=['A', 'B', 'C']
注意:如果您希望每个箱子中的观察数量相同,则df中的观察数量必须可以被3整除。

您看过吗

您的数据集的边缘恰好有重复项,因此,不可能将数据集存储到大小相同的存储箱中。但是,如果您不关心
排列
值是否流入另一个存储箱,则可以人为创建排名列

# first sort your df by `spread`
df_new_sample = df_new_sample.sort_values('spread')

# reset index
df_new_sample = df_new_sample.reset_index(drop=True)

# now qcut on the index
df_new_sample['spread_bin'] = pd.qcut(df_new_sample.index, 3, labels=['A', 'B', 'C']

注意:如果您希望每个箱子中的观察数相同,则df中的观察数必须可以被3整除。

我对pd.cut有相同的结果,它不会使每个箱子中的箱子具有相同的观察数,我不确定重复的问题。我对pd.cut有相同的结果,它不会产生biN在每个箱子中有相同数量的观察值,我不确定是否有重复的问题。哦,是的,我想这是OP想要的不,它在每个箱子中没有给我相同数量的观察值…@Scratch'N'Purr,我如何使用参数为retbins=True的pd.cut?哦,是的,我想这是OP想要的不,它没有给我same每个箱子中的观察数量…@Scratch'N'Purr,我如何使用参数retbins=True的pd.cut?