Python 熊猫在大数据集上应用速度。
我有一个熊猫表,有两列,Python 熊猫在大数据集上应用速度。,python,sql-server,pandas,lambda,Python,Sql Server,Pandas,Lambda,我有一个熊猫表,有两列,quarthourdimid和StartDateDimID;这些列为每个日期/季度小时配对提供了一个ID。例如,2015年1月1日下午12:15,StartDateDimID将等于1097,quarthourdimid将等于26。这就是我正在阅读的数据的组织方式 这是一个很大的表,我正在使用pyodbc和pandas.read_sql(),~450M行和~60列来读取,因此性能是一个问题 要将quarthourdimid和StartDateDimID列解析为可用的date
quarthourdimid
和StartDateDimID
;这些列为每个日期/季度小时配对提供了一个ID。例如,2015年1月1日下午12:15,StartDateDimID
将等于1097
,quarthourdimid
将等于26
。这就是我正在阅读的数据的组织方式
这是一个很大的表,我正在使用pyodbc
和pandas.read_sql()
,~450M行和~60列来读取,因此性能是一个问题
要将quarthourdimid
和StartDateDimID
列解析为可用的datetime
索引,我正在每行运行一个apply函数,以创建一个额外的列datetime
我在没有额外解析的情况下读取表的代码大约为800ms;但是,当我运行这个apply函数时,它会将总运行时间增加约4秒(预计查询时间在5.8-6秒之间)。返回的df
约为45K行和5列(~450days*~1001/4小时部分)
我希望能更有效地重写我所写的内容,并在此过程中获得任何意见
以下是我迄今为止编写的代码:
import pandas as pd
from datetime import datetime, timedelta
import pyodbc
def table(network, demo):
connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = {}
AND DemographicGroupDimID = {}
AND QuarterHourDimID IS NOT NULL)""".format(network, demo)
with pyodbc.connect(connection_string) as cnxn:
df = pd.read_sql(sql=sql, con=cnxn, index_col=None)
def time_map(quarter_hour, date):
if quarter_hour > 72:
return date + timedelta(minutes=(quarter_hour % 73)*15)
return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)
map_date = {}
init_date = datetime(year=2012, month=1, day=1)
for x in df.StartDateDimID.unique():
map_date[x] = init_date + timedelta(days=int(x)-1)
#this is the part of my code that is likely bogging things down
df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
map_date[row['StartDateDimID']]),
axis=1)
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df
仅举一个在SQL中执行日期-时间转换的例子,而不是熊猫和时间模型,使用上面的代码,平均执行时间为6.4s/次,我能够完全用SQL重写代码,平均执行时间为640ms/次 更新代码:
import pandas as pd
import pyodbc
SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
AND naf.RecordTypeDimID = 2
AND naf.AudienceEstimateTypeDimID = 1
AND naf.DailyOrWeeklyDimID = 1
AND naf.RecordSequenceCodeDimID = 5
AND naf.ViewingTypeDimID = 4
AND naf.NetworkDimID = 1278
AND naf.DemographicGroupDimID = 3
AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""
%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(sql=SQL_QUERY,
con=cnxn,
index_col=None)
200 loops, best of 3: 613 ms per loop
仅举一个在SQL中执行日期-时间转换的例子,而不是熊猫和时间模型,使用上面的代码,平均执行时间为6.4s/次,我能够完全用SQL重写代码,平均执行时间为640ms/次 更新代码:
import pandas as pd
import pyodbc
SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
AND naf.RecordTypeDimID = 2
AND naf.AudienceEstimateTypeDimID = 1
AND naf.DailyOrWeeklyDimID = 1
AND naf.RecordSequenceCodeDimID = 5
AND naf.ViewingTypeDimID = 4
AND naf.NetworkDimID = 1278
AND naf.DemographicGroupDimID = 3
AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""
%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(sql=SQL_QUERY,
con=cnxn,
index_col=None)
200 loops, best of 3: 613 ms per loop
我会尝试使用SQL Server的
DATEADD()
函数和CASE在SQL Server端添加一个虚拟列。。。什么时候然后。。。ELSE
-它应该快得多Hanks Max。这确实奏效了。我在性能上有了巨大的改进。我会尝试使用SQL Server的DATEADD()
函数和CASE在SQL Server端添加一个虚拟列。。。什么时候然后。。。ELSE
-它应该快得多Hanks Max。这确实奏效了。我的表现有了巨大的进步。