Python SQLAlchemy-在postgresql中执行批量升级（如果存在，则更新，否则插入）_Python_Postgresql_Sqlalchemy_Flask Sqlalchemy

Python SQLAlchemy-在postgresql中执行批量升级（如果存在，则更新，否则插入）

python postgresql sqlalchemy

Python SQLAlchemy-在postgresql中执行批量升级（如果存在，则更新，否则插入）,python,postgresql,sqlalchemy,flask-sqlalchemy,Python,Postgresql,Sqlalchemy,Flask Sqlalchemy,我正在尝试使用SQLAlchemy模块（不是SQL！）用python编写一个大容量upsert 我在SQLAlchemy add上遇到以下错误： sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates unique constraint "posts_pkey" DETAIL: Key (id)=(TEST1234) already exists. 我有一个名为posts的表，在id列上有一个主

我正在尝试使用SQLAlchemy模块（不是SQL！）用python编写一个大容量upsert

我在SQLAlchemy add上遇到以下错误：

sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates unique constraint "posts_pkey"
DETAIL:  Key (id)=(TEST1234) already exists.

我有一个名为

posts

的表，在

id

列上有一个主键

在本例中，我在db中已经有一行id=TEST1234。当我尝试

db.session.add（）

将

id

设置为

TEST1234

的新posts对象时，会出现上述错误。我的印象是，如果主键已经存在，记录将得到更新

我怎样才能使用仅基于主键的炼金术？有简单的解决方案吗？

如果没有，我可以随时检查并删除任何具有匹配id的记录，然后插入新记录，但这对于我的情况来说似乎很昂贵，因为我不希望有很多更新。

SQLAlchemy中有一个upsert式的操作：

db.session.merge（）

在我找到这个命令之后，我能够执行upsert，但值得一提的是，对于批量“upsert”，这个操作非常慢

另一种方法是获取要插入的主键列表，并在数据库中查询任何匹配的ID：

# Imagine that post1, post5, and post1000 are posts objects with ids 1, 5 and 1000 respectively
# The goal is to "upsert" these posts.
# we initialize a dict which maps id to the post object

my_new_posts = {1: post1, 5: post5, 1000: post1000} 

for each in posts.query.filter(posts.id.in_(my_new_posts.keys())).all():
    # Only merge those posts which already exist in the database
    db.session.merge(my_new_posts.pop(each.id))

# Only add those posts which did not exist in the database 
db.session.add_all(my_new_posts.values())

# Now we commit our modifications (merges) and inserts (adds) to the database!
db.session.commit()

使用编译扩展（）的另一种方法：

这应该确保所有insert语句都具有UPSERT的行为。此实现使用Postgres方言，但对于MySQL方言来说，应该很容易修改。

您可以利用

on\u conflict\u do\u update

变体。下面是一个简单的例子：

from sqlalchemy.dialects.postgresql import insert

class Post(Base):
    """
    A simple class for demonstration
    """

    id = Column(Integer, primary_key=True)
    title = Column(Unicode)

# Prepare all the values that should be "upserted" to the DB
values = [
    {"id": 1, "title": "mytitle 1"},
    {"id": 2, "title": "mytitle 2"},
    {"id": 3, "title": "mytitle 3"},
    {"id": 4, "title": "mytitle 4"},
]

stmt = insert(Post).values(values)
stmt = stmt.on_conflict_do_update(
    # Let's use the constraint name which was visible in the original posts error msg
    constraint="post_pkey",

    # The columns that should be updated on conflict
    set_={
        "title": stmt.excluded.title
    }
)
session.execute(stmt)

有关更多详细信息，请参见（f.ex.“排除”术语的来源）

关于重复列名的旁注上述代码将列名用作

值

列表中的dict键和

设置

的参数。如果在类定义中更改了列名，则需要到处更改，否则会中断。这可以通过访问列定义来避免，使代码更难看，但更健壮：

coldefs = Post.__table__.c

values = [
    {coldefs.id.name: 1, coldefs.title.name: "mytitlte 1"},
    ...
]

stmt = stmt.on_conflict_do_update(
    ...
    set_={
        coldefs.title.name: stmt.excluded.title
        ...
    }
)

这不是最安全的方法，但它非常简单和快速。我只是想有选择地覆盖表的一部分。我删除了我知道会发生冲突的已知行，然后从数据帧中追加新行。数据框列名需要与sql表列名匹配

eng = create_engine('postgresql://...')
conn = eng.connect()

conn.execute("DELETE FROM my_table WHERE col = %s", val)
df.to_sql('my_table', con=eng, if_exists='append')

我开始研究这一点，我认为我已经找到了一种非常有效的方法，通过混合使用

bulk\u insert\u映射

和

bulk\u update\u映射

而不是

merge

来改进sqlalchemy

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from contextlib import contextmanager


engine = None
Session = sessionmaker()
Base = declarative_base()


def creat_new_database(db_name="sqlite:///bulk_upsert_sqlalchemy.db"):
    global engine
    engine = create_engine(db_name, echo=False)
    local_session = scoped_session(Session)
    local_session.remove()
    local_session.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)


@contextmanager
def db_session():
    local_session = scoped_session(Session)
    session = local_session()

    session.expire_on_commit = False

    try:
        yield session
    except BaseException:
        session.rollback()
        raise
    finally:
        session.close()


class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))


def bulk_upsert_mappings(customers):

    entries_to_update = []
    entries_to_put = []
    with db_session() as sess:
        t0 = time.time()

        # Find all customers that needs to be updated and build mappings
        for each in (
            sess.query(Customer.id).filter(Customer.id.in_(customers.keys())).all()
        ):
            customer = customers.pop(each.id)
            entries_to_update.append({"id": customer["id"], "name": customer["name"]})

        # Bulk mappings for everything that needs to be inserted
        for customer in customers.values():
            entries_to_put.append({"id": customer["id"], "name": customer["name"]})

        sess.bulk_insert_mappings(Customer, entries_to_put)
        sess.bulk_update_mappings(Customer, entries_to_update)
        sess.commit()

    print(
        "Total time for upsert with MAPPING update "
        + str(len(customers))
        + " records "
        + str(time.time() - t0)
        + " sec"
        + " inserted : "
        + str(len(entries_to_put))
        + " - updated : "
        + str(len(entries_to_update))
    )


def bulk_upsert_merge(customers):

    entries_to_update = 0
    entries_to_put = []
    with db_session() as sess:
        t0 = time.time()

        # Find all customers that needs to be updated and merge
        for each in (
            sess.query(Customer.id).filter(Customer.id.in_(customers.keys())).all()
        ):
            values = customers.pop(each.id)
            sess.merge(Customer(id=values["id"], name=values["name"]))
            entries_to_update += 1

        # Bulk mappings for everything that needs to be inserted
        for customer in customers.values():
            entries_to_put.append({"id": customer["id"], "name": customer["name"]})

        sess.bulk_insert_mappings(Customer, entries_to_put)
        sess.commit()

    print(
        "Total time for upsert with MERGE update "
        + str(len(customers))
        + " records "
        + str(time.time() - t0)
        + " sec"
        + " inserted : "
        + str(len(entries_to_put))
        + " - updated : "
        + str(entries_to_update)
    )


if __name__ == "__main__":

    batch_size = 10000

    # Only inserts
    customers_insert = {
        i: {"id": i, "name": "customer_" + str(i)} for i in range(batch_size)
    }

    # 50/50 inserts update
    customers_upsert = {
        i: {"id": i, "name": "customer_2_" + str(i)}
        for i in range(int(batch_size / 2), batch_size + int(batch_size / 2))
    }

    creat_new_database()
    bulk_upsert_mappings(customers_insert.copy())
    bulk_upsert_mappings(customers_upsert.copy())
    bulk_upsert_mappings(customers_insert.copy())

    creat_new_database()
    bulk_upsert_merge(customers_insert.copy())
    bulk_upsert_merge(customers_upsert.copy())
    bulk_upsert_merge(customers_insert.copy())

基准的结果如下：

Total time for upsert with MAPPING: 0.17138004302978516 sec inserted : 10000 - updated : 0
Total time for upsert with MAPPING: 0.22074174880981445 sec inserted : 5000 - updated : 5000
Total time for upsert with MAPPING: 0.22307634353637695 sec inserted : 0 - updated : 10000
Total time for upsert with MERGE: 0.1724097728729248 sec inserted : 10000 - updated : 0
Total time for upsert with MERGE: 7.852903842926025 sec inserted : 5000 - updated : 5000
Total time for upsert with MERGE: 15.11970829963684 sec inserted : 0 - updated : 10000

合并不处理初始化错误上述过程非常缓慢，无法使用itMerge没有帮助，如果您在唯一索引上发现

replicate key

错误，则它仅适用于主键Merge没有任何tegridy如果原始问题未提及SQLAlchemy，那么该重复项是什么？使用该代码段时出现此错误：

SQLAlchemy.exc.ProgrammingError:（psycopg2.errors.SyntaxError）第1行：…上）值（'US^怀俄明州^奥尔巴尼'，''）上冲突（）时出现语法错误。请更新…

Ah nice catch！如果您的表中没有主键，这将不起作用。让我添加一个补丁。事实上，我不知道如果没有主键，为什么会需要这个-你能详细说明一下这个问题吗？将所有插入转换为upsert是有风险的。有时，为了数据一致性和避免意外覆盖，您需要获取完整性错误。我只会使用这个解决方案，如果你是120%的意识到所有的影响，这有！我的

constraint=“post\u pkey”

代码失败，因为sqlalchemy找不到我在原始sql

CREATE unique INDEX post\u pkey中创建的唯一约束…

然后用

metadata.reflect（eng，only=“My\u table”）加载到sqlalchemy中

之后，我收到一条警告

base.py:3515:SAWarning:跳过了基于表达式的索引post_pkey的不受支持的反射

关于如何修复的任何提示？@user1071182我认为最好将此作为单独的问题发布。它将允许您添加更多细节。如果看不到完整的

createindex

语句，很难猜出这里出了什么问题。但我不能保证任何事情，因为我还没有使用SQLAlchemy处理过部分索引。但也许其他人会有解决办法。这是解决这个问题最干净的办法！

Total time for upsert with MAPPING: 0.17138004302978516 sec inserted : 10000 - updated : 0
Total time for upsert with MAPPING: 0.22074174880981445 sec inserted : 5000 - updated : 5000
Total time for upsert with MAPPING: 0.22307634353637695 sec inserted : 0 - updated : 10000
Total time for upsert with MERGE: 0.1724097728729248 sec inserted : 10000 - updated : 0
Total time for upsert with MERGE: 7.852903842926025 sec inserted : 5000 - updated : 5000
Total time for upsert with MERGE: 15.11970829963684 sec inserted : 0 - updated : 10000