Python 巨蟒猛扑

Python 巨蟒猛扑,python,pandas,postgresql,Python,Pandas,Postgresql,我有每天的数据管道,需要读取文件并将数据写入postgres数据库。其中一些文件可能是新旧数据的混合体 我当前的过程是获取文件并将其写入Dataframe,将Dataframe转储到暂存表,然后执行并向上插入。这是可行的,但我可能会更新完全相同的值。我会比较一行中的所有列,看看是否有新的值,这似乎比它的值更麻烦(有些表可能有30多列) 有更好或更优化的方法来实现这一点吗?我觉得我的表或源文件越大,这个过程需要的时间就越多。以下是我的流程的一些示例代码: def example_import(fi

我有每天的数据管道,需要读取文件并将数据写入postgres数据库。其中一些文件可能是新旧数据的混合体

我当前的过程是获取文件并将其写入Dataframe,将Dataframe转储到暂存表,然后执行并向上插入。这是可行的,但我可能会更新完全相同的值。我会比较一行中的所有列,看看是否有新的值,这似乎比它的值更麻烦(有些表可能有30多列)

有更好或更优化的方法来实现这一点吗?我觉得我的表或源文件越大,这个过程需要的时间就越多。以下是我的流程的一些示例代码:

def example_import(file_path, engine):
    '''
    ETL example
    '''

    df_demo = download_from_ftp(file_path)

    if df_demo.empty:
        return

    # some cleaning might happen here

    df_demo.to_sql(con = engine, name = 'tbl', schema = 'staging', if_exists = 'replace', index = False, method = 'multi')

    with open('/somefolder/upsert.sql') as fp:
        upsert = fp.read()

    result = engine.execute(upsert)
    print('Inserted/Updated {} Records From {}'.format(result.rowcount, file_path))
示例数据

当前表(实际上有更多的列)

传入数据(新旧混合。一些旧PK有新值)

Upsert后的数据

向上插入示例

INSERT INTO public."tbl" ("colA",
                                 "colB",
                                 "colC",
                                 "colD",
                                 "colE",
                                 "colF",
                                 "colG",
                                 "colH",
                                 "colI",
                                 "colJ",
                                 "colK",
                                 "colL",
                                 "colM",
                                 "colN",
                                 "colO",
                                 "colP",
                                 "colQ",
                                 "colR",
                                 "colS",
                                 "colT",
                                 "colU",
                                 "colV"
SELECT "colA",
         "colB",
         "colC",
         "colD",
         "colE"::DATE,
         "colF",
         "colG"::DATE,
         "colH",
         "colI"::DATE,
         "colJ"::DATE,
         "colK",
         "colL",
         "colM",
         "colN",
         "colO",
         "colP",
         "colQ",
         "colR",
         "colS",
         "colT",
         "colU",
         "colV"
FROM staging."tbl"
ON CONFLICT ("colA") DO UPDATE 
    SET "colB" = excluded."colB",
         "colC" = excluded."colC",
         "colD" = excluded."colD",
         "colE" = excluded."colE",
         "colF" = excluded."colF",
         "colG" = excluded."colG",
         "colH" = excluded."colH",
         "colI" = excluded."colI",
         "colJ" = excluded."colJ",
         "colK" = excluded."colK",
         "colL" = excluded."colL",
         "colM" = excluded."colM",
         "colN" = excluded."colN",
         "colO" = excluded."colO",
         "colP" = excluded."colP",
         "colQ" = excluded."colQ",
         "colR" = excluded."colR",
         "colS" = excluded."colS",
         "colT" = excluded."colT",
         "colU" = excluded."colU",
         "colV" = excluded."colV"

如果定义了主键,则可以在冲突上使用
,在目标able上不执行任何操作
。我需要更新的实际值如何?假设具有相同值的两行具有相同的主键--自然键作为主键。键可以基于多个列。示例数据会很有帮助。添加了一些示例数据。据我所知,您只想更新一些更改后的字段。不幸的是,这不太可能,因为postgres每次更新都会更改整个元组/行。您可能希望将主表拆分为较小的表,其中包含您试图实现的字段子集。