Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/292.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 多记录优化_Python_Pandas - Fatal编程技术网

Python 多记录优化

Python 多记录优化,python,pandas,Python,Pandas,我有一个档案,有大约500K的记录。 每个记录都需要验证。 记录被消除重复并存储在列表中: with open(filename) as f: records = f.readlines() 我使用的验证文件存储在一个数据框中 这个数据帧包含大约80K条记录和9列(myfile.csv) 使用MacOS 8GB/2.3 GHz Intel Core i7 使用Cprofile.run-in检查功能单独显示: 4253 function calls (4199 primitive cal

我有一个档案,有大约500K的记录。 每个记录都需要验证。 记录被消除重复并存储在列表中:

with open(filename) as f:
    records = f.readlines()
我使用的验证文件存储在一个数据框中 这个数据帧包含大约80K条记录和9列(myfile.csv)

使用MacOS 8GB/2.3 GHz Intel Core i7

使用Cprofile.run-in检查功能单独显示:

4253 function calls (4199 primitive calls) in 0.017 seconds.

因此,我假设500 K将花费大约2个1/2小时

< P>而没有可用的数据,考虑这两个数据块的左联合并的未测试的方法,然后运行验证步骤。这将避免任何循环,并跨列运行条件逻辑:

import pandas as pd
import numpy as np

with open('RecordsValidate.txt') as f:
    records = f.readlines()
    print(records)

rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
                    'rcd_area_code': [int(rcd[:3]) for rcd in records],
                    'rcd_office_code': [int(rcd[3:6]) for rcd in records],
                    'rcd_subscriber_number': [rcd[6:] for rcd in records]})

filename = 'myfile.csv'
df = pd.read_csv(filename)

# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)

mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

# VALIDATE OFFICE CODE                         
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
                 right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])

# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
                           (mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
                           (mrgdf['LABEL'].str.len() =  0),
                           'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
将熊猫作为pd导入
将numpy作为np导入
将open('RecordsValidate.txt')作为f:
记录=f.读线()
打印(记录)
rdf=pd.DataFrame({'rcd_id':列表(范围(1,len(记录)+1)),
“rcd区域代码”:[int(rcd[:3]),用于记录中的rcd],
“rcd办公室代码”:[int(rcd[3:6])用于记录中的rcd],
“rcd_订户_号码”:[rcd[6:]用于记录中的rcd]})
文件名='myfile.csv'
df=pd.read\u csv(文件名)
#验证区号
mrgdf=pd.merge(df,rdf,how='left',left'on=['AREA'u CODE'],right'u on=['rcd'u AREA'u CODE']))
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_AREA_CODE',np.nan)
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)
#验证办公室代码
mrgdf=pd.merge(mrgdf,rdf,how='left',left'on=['AREA'u CODE','OFFICE'u CODE'],
右上=['rcd\地区\代码','rcd\办公室\代码'])
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_OFFICE_CODE',mrgdf['RETURN']))
#验证订户
mrgdf['RETURN']=np.where((mrgdf['rcd_订户_number']mrgdf['rcd_用户_端]]|
(mrgdf['LABEL'].str.len()=0),
“无效的订阅服务器”,mrgdf[“返回])
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)

在没有可用数据的情况下,考虑两个数据块的两个左并入合并的未测试方法,然后运行验证步骤。这将避免任何循环,并跨列运行条件逻辑:

import pandas as pd
import numpy as np

with open('RecordsValidate.txt') as f:
    records = f.readlines()
    print(records)

rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
                    'rcd_area_code': [int(rcd[:3]) for rcd in records],
                    'rcd_office_code': [int(rcd[3:6]) for rcd in records],
                    'rcd_subscriber_number': [rcd[6:] for rcd in records]})

filename = 'myfile.csv'
df = pd.read_csv(filename)

# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)

mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

# VALIDATE OFFICE CODE                         
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
                 right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])

# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
                           (mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
                           (mrgdf['LABEL'].str.len() =  0),
                           'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
将熊猫作为pd导入
将numpy作为np导入
将open('RecordsValidate.txt')作为f:
记录=f.读线()
打印(记录)
rdf=pd.DataFrame({'rcd_id':列表(范围(1,len(记录)+1)),
“rcd区域代码”:[int(rcd[:3]),用于记录中的rcd],
“rcd办公室代码”:[int(rcd[3:6])用于记录中的rcd],
“rcd_订户_号码”:[rcd[6:]用于记录中的rcd]})
文件名='myfile.csv'
df=pd.read\u csv(文件名)
#验证区号
mrgdf=pd.merge(df,rdf,how='left',left'on=['AREA'u CODE'],right'u on=['rcd'u AREA'u CODE']))
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_AREA_CODE',np.nan)
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)
#验证办公室代码
mrgdf=pd.merge(mrgdf,rdf,how='left',left'on=['AREA'u CODE','OFFICE'u CODE'],
右上=['rcd\地区\代码','rcd\办公室\代码'])
mrgdf['RETURN']=np.where(pd.isnull('rcd_id'),'INVALID_OFFICE_CODE',mrgdf['RETURN']))
#验证订户
mrgdf['RETURN']=np.where((mrgdf['rcd_订户_number']mrgdf['rcd_用户_端]]|
(mrgdf['LABEL'].str.len()=0),
“无效的订阅服务器”,mrgdf[“返回])
mrgdf.drop([c代表rdf.columns中的c],inplace=True,axis=1)

使用
numpy.any
代替内置的
any
,您可能会获得一些简单的加速。与其检查df.area中的任何值是否与区号匹配,不如将其翻转过来。检查是否设置了区域代码(df[‘区域代码’])。检查集合中是否存在是O(1)对O(N),用于将列表中的所有项与值进行比较。您需要提前构建区号集,并将其作为第三个参数传递给函数,因为构建区号集是O(N)。在哪里使用了
destination
arg?此函数看起来只在数据帧内进行验证,而不比较外部值。@Parfait刚刚更新了此函数。使用
numpy.any
代替内置的
any
,您可能会得到一些简单的加速。与其检查df.area中的任何值是否与区号匹配,不如将其翻转过来。检查是否设置了区域代码(df[‘区域代码’])。检查集合中是否存在是O(1)对O(N),用于将列表中的所有项与值进行比较。您需要提前构建区号集,并将其作为第三个参数传递给函数,因为构建区号集是O(N)。在哪里使用了
destination
arg?此函数看起来只在数据帧内进行验证,而不比较外部值。@Parfait刚刚更新了此函数。
import pandas as pd
import numpy as np

with open('RecordsValidate.txt') as f:
    records = f.readlines()
    print(records)

rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
                    'rcd_area_code': [int(rcd[:3]) for rcd in records],
                    'rcd_office_code': [int(rcd[3:6]) for rcd in records],
                    'rcd_subscriber_number': [rcd[6:] for rcd in records]})

filename = 'myfile.csv'
df = pd.read_csv(filename)

# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)

mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

# VALIDATE OFFICE CODE                         
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
                 right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])

# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
                           (mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
                           (mrgdf['LABEL'].str.len() =  0),
                           'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)