Python 熊猫在大数据集上应用速度。

Python 熊猫在大数据集上应用速度。,python,sql-server,pandas,lambda,Python,Sql Server,Pandas,Lambda,我有一个熊猫表,有两列,quarthourdimid和StartDateDimID;这些列为每个日期/季度小时配对提供了一个ID。例如,2015年1月1日下午12:15,StartDateDimID将等于1097,quarthourdimid将等于26。这就是我正在阅读的数据的组织方式 这是一个很大的表,我正在使用pyodbc和pandas.read_sql(),~450M行和~60列来读取,因此性能是一个问题 要将quarthourdimid和StartDateDimID列解析为可用的date

我有一个熊猫表,有两列,
quarthourdimid
StartDateDimID
;这些列为每个日期/季度小时配对提供了一个ID。例如,2015年1月1日下午12:15,
StartDateDimID
将等于
1097
quarthourdimid
将等于
26
。这就是我正在阅读的数据的组织方式

这是一个很大的表,我正在使用
pyodbc
pandas.read_sql()
,~450M行和~60列来读取,因此性能是一个问题

要将
quarthourdimid
StartDateDimID
列解析为可用的
datetime
索引,我正在每行运行一个apply函数,以创建一个额外的列
datetime

我在没有额外解析的情况下读取表的代码大约为800ms;但是,当我运行这个apply函数时,它会将总运行时间增加约4秒(预计查询时间在5.8-6秒之间)。返回的
df
约为45K行和5列(~450days*~1001/4小时部分)

我希望能更有效地重写我所写的内容,并在此过程中获得任何意见

以下是我迄今为止编写的代码:

import pandas as pd
from datetime import datetime, timedelta
import pyodbc

def table(network, demo):
    connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
    sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
        WHERE (MarketDimID = 1
        AND RecordTypeDimID = 2
        AND EstimateTypeDimID = 1
        AND DailyOrWeeklyDimID = 1
        AND RecordSequenceCodeDimID = 5
        AND ViewingTypeDimID = 4
        AND NetworkDimID = {}
        AND DemographicGroupDimID = {}
        AND QuarterHourDimID IS NOT NULL)""".format(network, demo)

    with pyodbc.connect(connection_string) as cnxn:
        df = pd.read_sql(sql=sql, con=cnxn, index_col=None)


    def time_map(quarter_hour, date):
        if quarter_hour > 72:
            return date + timedelta(minutes=(quarter_hour % 73)*15)
        return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)

    map_date  = {}

    init_date = datetime(year=2012, month=1, day=1)

    for x in df.StartDateDimID.unique():
        map_date[x] = init_date + timedelta(days=int(x)-1)

    #this is the part of my code that is likely bogging things down
    df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
                                                   map_date[row['StartDateDimID']]),
                                                   axis=1)
    if network == 1278:
        df = df.loc[df.groupby('datetime')['Impression'].idxmin()]

    df = df.set_index(['datetime'])

    return df

仅举一个在SQL中执行日期-时间转换的例子,而不是熊猫和时间模型,使用上面的代码,平均执行时间为6.4s/次,我能够完全用SQL重写代码,平均执行时间为640ms/次

更新代码:

import pandas as pd
import pyodbc

SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) 
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
    AND naf.RecordTypeDimID = 2
    AND naf.AudienceEstimateTypeDimID = 1
    AND naf.DailyOrWeeklyDimID = 1
    AND naf.RecordSequenceCodeDimID = 5
    AND naf.ViewingTypeDimID = 4
    AND naf.NetworkDimID = 1278
    AND naf.DemographicGroupDimID = 3
    AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""

%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
    df = pd.read_sql(sql=SQL_QUERY,
            con=cnxn,
            index_col=None)
200 loops, best of 3: 613 ms per loop

仅举一个在SQL中执行日期-时间转换的例子,而不是熊猫和时间模型,使用上面的代码,平均执行时间为6.4s/次,我能够完全用SQL重写代码,平均执行时间为640ms/次

更新代码:

import pandas as pd
import pyodbc

SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) 
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
    AND naf.RecordTypeDimID = 2
    AND naf.AudienceEstimateTypeDimID = 1
    AND naf.DailyOrWeeklyDimID = 1
    AND naf.RecordSequenceCodeDimID = 5
    AND naf.ViewingTypeDimID = 4
    AND naf.NetworkDimID = 1278
    AND naf.DemographicGroupDimID = 3
    AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""

%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
    df = pd.read_sql(sql=SQL_QUERY,
            con=cnxn,
            index_col=None)
200 loops, best of 3: 613 ms per loop

我会尝试使用SQL Server的
DATEADD()
函数和
CASE在SQL Server端添加一个虚拟列。。。什么时候然后。。。ELSE
-它应该快得多Hanks Max。这确实奏效了。我在性能上有了巨大的改进。我会尝试使用SQL Server的
DATEADD()
函数和
CASE在SQL Server端添加一个虚拟列。。。什么时候然后。。。ELSE
-它应该快得多Hanks Max。这确实奏效了。我的表现有了巨大的进步。