Python 数据帧：每批行的操作_Python_Pandas_Performance_Batch Processing

Python 数据帧：每批行的操作

python pandas performance

Python 数据帧：每批行的操作,python,pandas,performance,batch-processing,Python,Pandas,Performance,Batch Processing,我有一个pandas DataFramedf，我想为它计算每批行的一些统计信息例如，假设我有一个batch\u size=200000 对于每批batch\u size行，我希望数据帧的列ID具有唯一值的数量我怎么能做那样的事以下是我想要的示例： print(df) >> +-------+ | ID| +-------+ | 1| | 1| | 2| | 2| | 2| | 3| | 3| |

我有一个pandas DataFrame

df

，我想为它计算每批行的一些统计信息

例如，假设我有一个

batch\u size=200000

对于每批

batch\u size

行，我希望数据帧的列

ID

具有唯一值的数量

我怎么能做那样的事

以下是我想要的示例：

print(df)

>>
+-------+
|     ID|
+-------+
|      1|
|      1|
|      2|
|      2|
|      2|
|      3|
|      3|
|      3|
|      3|
+-------+

batch_size = 3

my_new_function(df,batch_size)

>>
For batch 1 (0 to 2) :
2 unique values 
1 appears 2 times
2 appears 1 time

For batch 2 (3 to 5) : 
2 unique values 
2 appears 2 times
3 appears 1 time

For batch 3 (6 to 8) 
1 unique values 
3 appears 3 times

注意：输出当然可以是一个简单的数据帧

请参见拆分数据帧。在那之后，我会做：

from collections import Counter
Counter(batch_df['ID'].tolist())

请参阅以了解拆分过程，然后可以执行此操作以获取唯一“ID”的编号

df = pd.DataFrame({'ID' : [1, 1, 2, 2, 2, 3, 3, 3, 3]})
batch_size = 3
result = []
for batch_number, batch_df in df.groupby(np.arange(len(df)) // batch_size):
    result.append(batch_df['ID'].nunique())
pd.DataFrame(result)

编辑：使用用户3426270的答案，我在回答组时没有注意到它。使用自定义聚合函数可能会解决您的问题

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':[1,1,2,2,2,3,3,3,3], 'X':1})

batch_size = 3
batches = np.ceil(df.shape[0]/batch_size)
df.index = pd.cut(df.index,batches,labels=range(batches))

###########

def myFunc(batch_data :pd.DataFrame):
    #print(batch_data.unique(),'\n')
    return batch_data.nunique()

output1 = df.groupby(df.index).aggregate({'ID':myFunc})
output2 = df.groupby(df.index).aggregate(myFunc)
output3 = df.groupby(df.index).aggregate({'ID':myFunc,'X':'std'})

# #输出

创建一个df_批处理，然后尝试df_批处理。groupby（“ID”）。删除重复项（）.size（）在这里不需要按ID进行分组。在我看来，您可以使用

df_批处理。删除重复项（子集=['ID']）.size（）

。但是仍然没有回答这个问题，你所说的批是什么意思，它是随机的200000行吗？例如，请发布一个样本输入

df

，以及较小的

batch\u size

（

batch\u size=3

）的预期输出

#print(output1)
   ID
0   2
1   2
2   1

#print(output2)
   ID  X
0   2  1
1   2  1
2   1  1

#print(output3)
   ID    X
0   2  0.0
1   2  0.0
2   1  0.0