使用dataframe Python中的索引以最快的方式创建字典的新列表
我在字典使用dataframe Python中的索引以最快的方式创建字典的新列表,python,dataframe,dictionary,multiprocessing,python-multiprocessing,Python,Dataframe,Dictionary,Multiprocessing,Python Multiprocessing,我在字典索引\u data中有大约200密耳的数据: index_data = [ {3396623046050748: [0, 1], 3749192045350356: [2], 4605074846433127: [3], 112884719857303: [4], 507466746864539: [5], ..... } ] CustID Score Number1 Numb
索引\u data
中有大约200密耳的数据:
index_data = [
{3396623046050748: [0, 1],
3749192045350356: [2],
4605074846433127: [3],
112884719857303: [4],
507466746864539: [5],
.....
}
]
CustID Score Number1 Number2 Phone
3396623046050748 2 2 3 0000
3396623046050748 6 2 3 0000
3749192045350356 1 56 23 2222
4605074846433127 67 532 321 3333
112884719857303 3 11 66 4444
507466746864539 7 22 96 5555
键是CustId中的值,值是df_数据中CustId的索引:
index_data = [
{3396623046050748: [0, 1],
3749192045350356: [2],
4605074846433127: [3],
112884719857303: [4],
507466746864539: [5],
.....
}
]
CustID Score Number1 Number2 Phone
3396623046050748 2 2 3 0000
3396623046050748 6 2 3 0000
3749192045350356 1 56 23 2222
4605074846433127 67 532 321 3333
112884719857303 3 11 66 4444
507466746864539 7 22 96 5555
我有一个数据框df_data
:
index_data = [
{3396623046050748: [0, 1],
3749192045350356: [2],
4605074846433127: [3],
112884719857303: [4],
507466746864539: [5],
.....
}
]
CustID Score Number1 Number2 Phone
3396623046050748 2 2 3 0000
3396623046050748 6 2 3 0000
3749192045350356 1 56 23 2222
4605074846433127 67 532 321 3333
112884719857303 3 11 66 4444
507466746864539 7 22 96 5555
注意:如果CustID
重复,则只有列Score
在每行中有不同的数据
我想创建一个新的dict列表(Total_Score
是每个客户ID的平均Score
,Number
是Number2
除以Number1
):
我的解决方案是循环我的字典并使用多处理
从多处理导入进程,管理器
def calculateTime(ns, value):
# get data with share of each process
df_data2 = ns.df_data
result2 = ns.result
# Create new DF from index and old DF
df_sampleresult = df_data2.loc[value].reset_index(drop = True)
# create sample list to save data need to append in final result
dict_sample['CustID'] = df_sampleresult['CustID'][0]
dict_sample['Time_Score'] = df_sampleresult['Score'].mean()
result2.append(dict_sample)
ns.result = result2
ns.df_data = df_data
if __name__ == '__main__':
result = list()
manager = Manager()
ns = manager.Namespace()
ns.df = df_data
ns.result = result
job = [Process(target = calculateTime, args=(ns,value)) for key,value in
index_data.items()]
_ = [p.start() for p in job]
_ = [p.join() for p in job]
但它不起作用。性能慢,内存高?我的多进程设置正确吗?还有其他方法吗?什么是“newfunction”和“table”,在哪里使用“CalculateTime”?编辑问题以显示抱歉,我现在将更新。什么是“newfunction”和“table”,在哪里使用“CalculateTime”?编辑问题以显示抱歉,我将立即更新。
In [353]: df
Out[353]:
CustID Score Number1 Number2 Phone
0 3396623046050748 2 2 3 0000
1 3396623046050748 6 2 3 0000
2 3749192045350356 1 56 23 2222
3 4605074846433127 67 532 321 3333
4 112884719857303 3 11 66 4444
5 507466746864539 7 22 96 5555
In [351]: d = df.groupby(['CustID', 'Phone', round(df.Number2.div(df.Number1), 2)])['Score'].mean().reset_index(name='Total_Score').rename(columns={'level_2': 'Number'}).to_dict('records')
In [352]: d
Out[352]:
[{'CustID': 112884719857303, 'Phone': 4444, 'Number': 6.0, 'Total_Score': 3},
{'CustID': 507466746864539, 'Phone': 5555, 'Number': 4.36, 'Total_Score': 7},
{'CustID': 3396623046050748, 'Phone': 0000, 'Number': 1.5, 'Total_Score': 4},
{'CustID': 3749192045350356, 'Phone': 2222, 'Number': 0.41, 'Total_Score': 1},
{'CustID': 4605074846433127, 'Phone': 3333, 'Number': 0.6, 'Total_Score': 67}]