Python 多记录优化
我有一个档案,有大约500K的记录。 每个记录都需要验证。 记录被消除重复并存储在列表中:Python 多记录优化,python,pandas,Python,Pandas,我有一个档案,有大约500K的记录。 每个记录都需要验证。 记录被消除重复并存储在列表中: with open(filename) as f: records = f.readlines() 我使用的验证文件存储在一个数据框中 这个数据帧包含大约80K条记录和9列(myfile.csv) 使用MacOS 8GB/2.3 GHz Intel Core i7 使用Cprofile.run-in检查功能单独显示: 4253 function calls (4199 primitive cal
with open(filename) as f:
records = f.readlines()
我使用的验证文件存储在一个数据框中
这个数据帧包含大约80K条记录和9列(myfile.csv)
使用MacOS 8GB/2.3 GHz Intel Core i7
使用Cprofile.run-in检查功能单独显示:
4253 function calls (4199 primitive calls) in 0.017 seconds.
因此,我假设500 K将花费大约2个1/2小时
< P>而没有可用的数据,考虑这两个数据块的左联合并的未测试的方法,然后运行验证步骤。这将避免任何循环,并跨列运行条件逻辑:import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
将熊猫作为pd导入
将numpy作为np导入
将open('RecordsValidate.txt')作为f:
记录=f.读线()
打印(记录)
rdf=pd.DataFrame({'rcd_id':列表(范围(1,len(记录)+1)),
“rcd区域代码”:[int(rcd[:3]),用于记录中的rcd],
“rcd办公室代码”:[int(rcd[3:6])用于记录中的rcd],
“rcd_订户_号码”:[rcd[6:]用于记录中的rcd]})
文件名='myfile.csv'
df=pd.read\u csv(文件名)
#验证区号
mrgdf=pd.merge(df,rdf,how='left',left'on=['AREA'u CODE'],right'u on=['rcd'u AREA'u CODE']))
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_AREA_CODE',np.nan)
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)
#验证办公室代码
mrgdf=pd.merge(mrgdf,rdf,how='left',left'on=['AREA'u CODE','OFFICE'u CODE'],
右上=['rcd\地区\代码','rcd\办公室\代码'])
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_OFFICE_CODE',mrgdf['RETURN']))
#验证订户
mrgdf['RETURN']=np.where((mrgdf['rcd_订户_number']mrgdf['rcd_用户_端]]|
(mrgdf['LABEL'].str.len()=0),
“无效的订阅服务器”,mrgdf[“返回])
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)
在没有可用数据的情况下,考虑两个数据块的两个左并入合并的未测试方法,然后运行验证步骤。这将避免任何循环,并跨列运行条件逻辑:
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
将熊猫作为pd导入
将numpy作为np导入
将open('RecordsValidate.txt')作为f:
记录=f.读线()
打印(记录)
rdf=pd.DataFrame({'rcd_id':列表(范围(1,len(记录)+1)),
“rcd区域代码”:[int(rcd[:3]),用于记录中的rcd],
“rcd办公室代码”:[int(rcd[3:6])用于记录中的rcd],
“rcd_订户_号码”:[rcd[6:]用于记录中的rcd]})
文件名='myfile.csv'
df=pd.read\u csv(文件名)
#验证区号
mrgdf=pd.merge(df,rdf,how='left',left'on=['AREA'u CODE'],right'u on=['rcd'u AREA'u CODE']))
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_AREA_CODE',np.nan)
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)
#验证办公室代码
mrgdf=pd.merge(mrgdf,rdf,how='left',left'on=['AREA'u CODE','OFFICE'u CODE'],
右上=['rcd\地区\代码','rcd\办公室\代码'])
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_OFFICE_CODE',mrgdf['RETURN']))
#验证订户
mrgdf['RETURN']=np.where((mrgdf['rcd_订户_number']mrgdf['rcd_用户_端]]|
(mrgdf['LABEL'].str.len()=0),
“无效的订阅服务器”,mrgdf[“返回])
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)
使用numpy.any
代替内置的any
,您可能会获得一些简单的加速。与其检查df.area中的任何值是否与区号匹配,不如将其翻转过来。检查是否设置了区域代码(df[‘区域代码’])。检查集合中是否存在是O(1)对O(N),用于将列表中的所有项与值进行比较。您需要提前构建区号集,并将其作为第三个参数传递给函数,因为构建区号集是O(N)。在哪里使用了destination
arg?此函数看起来只在数据帧内进行验证,而不比较外部值。@Parfait刚刚更新了此函数。使用numpy.any
代替内置的any
,您可能会得到一些简单的加速。与其检查df.area中的任何值是否与区号匹配,不如将其翻转过来。检查是否设置了区域代码(df[‘区域代码’])。检查集合中是否存在是O(1)对O(N),用于将列表中的所有项与值进行比较。您需要提前构建区号集,并将其作为第三个参数传递给函数,因为构建区号集是O(N)。在哪里使用了destination
arg?此函数看起来只在数据帧内进行验证,而不比较外部值。@Parfait刚刚更新了此函数。
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)