Python 更快地比较两个表（Postgres/SQLAlchemy）_Python_Postgresql_Sqlalchemy_Postgresql 9.6

Python 更快地比较两个表（Postgres/SQLAlchemy）

python postgresql sqlalchemy

Python 更快地比较两个表（Postgres/SQLAlchemy）,python,postgresql,sqlalchemy,postgresql-9.6,Python,Postgresql,Sqlalchemy,Postgresql 9.6,我用python编写了一段代码来操作数据库中的表。我使用SQL炼金术来实现这一点。基本上，我有表1，其中有250万个条目。我还有一个表2，有200000个条目。基本上，我要做的是将表1中的源ip和目标ip与表2中的源ip和目标ip进行比较。如果存在匹配项，我将表1中的ip源和ip目的地替换为与表2中的ip源和ip目的地匹配的数据，并在表3中添加条目。我的代码还检查条目是否不在新表中。如果是这样，它将跳过它，然后继续下一行。我的问题是它非常慢。我昨天发布了我的脚本，在24小时内，它只通过了250

我用python编写了一段代码来操作数据库中的表。我使用SQL炼金术来实现这一点。基本上，我有表1，其中有250万个条目。我还有一个表2，有200000个条目。基本上，我要做的是将表1中的源ip和目标ip与表2中的源ip和目标ip进行比较。如果存在匹配项，我将表1中的ip源和ip目的地替换为与表2中的ip源和ip目的地匹配的数据，并在表3中添加条目。我的代码还检查条目是否不在新表中。如果是这样，它将跳过它，然后继续下一行。我的问题是它非常慢。我昨天发布了我的脚本，在24小时内，它只通过了250万个条目中的47000个条目。我想知道是否有任何方法可以加快这个过程。这是一个postgres db，我不知道花费这么多时间的脚本是否合理，或者是否出了问题。如果有人有过类似的经历，完成之前要花多少时间？非常感谢

session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
    for flow in flows:
        if flow.vlan_destination in vlan_list:
            usage = session.query(Table2).filter(Table2.ip ==
                                                                                     str(flow.ip_destination)).all()
            if len(usage) > 0:
                usage = usage[0].usage
            else:
                usage = str(flow.ip_destination)
            usage_ip_src = session.query(Table2).filter(Table2.ip ==
                                                                                                    str(flow.ip_source)).all()
            if len(usage_ip_src) > 0:
                usage_ip_src = usage_ip_src[0].usage
            else:
                usage_ip_src = str(flow.ip_source)
            if flow.protocol == "17":
                protocol = func.REPLACE(flow.protocol, "17", 'UDP')
            elif flow.protocol == "1":
                protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
            elif flow.protocol == "6":
                protocol = func.REPLACE(flow.protocol, "6", 'TCP')
            else:
                protocol = flow.protocol
            is_in_db = session.query(Table3).filter(Table3.protocol ==
                                                                                            protocol)\
                .filter(Table3.application == flow.application)\
                .filter(Table3.destination_port == flow.destination_port)\
                .filter(Table3.vlan_destination == flow.vlan_destination)\
                .filter(Table3.usage_source == usage_ip_src)\
                .filter(Table3.state == flow.state)\
                .filter(Table3.usage_destination == usage).count()
            if is_in_db == 0:
                to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
                                                flow.vlan_destination, flow.state)
                session.add(to_add)
                session.flush()
                session.commit()
                print("added " + str(i))
            else:
                print("usage already in DB")
        i = i + 1

session.close()

考虑到目前的问题，我认为这至少接近于你想要的。其想法是在数据库中执行整个操作，而不是获取所有内容（整个2500000行）并使用Python等进行过滤：

from sqlalchemy import func, case
from sqlalchemy.orm import aliased


def newhotness(session, vlan_list):
    # The query needs to join Table2 twice, so it has to be aliased
    dst = aliased(Table2)
    src = aliased(Table2)

    # Prepare required SQL expressions
    usage = func.coalesce(dst.usage, Table1.ip_destination)
    usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
    protocol = case({"17": "UDP",
                     "1": "ICMP",
                     "6": "TCP"},
                    value=Table1.protocol,
                    else_=Table1.protocol)

    # Form a query producing the data to insert to Table3
    flows = session.query(
            usage_ip_src,
            usage,
            protocol,
            Table1.application,
            Table1.destination_port,
            Table1.vlan_destination,
            Table1.state).\
        outerjoin(dst, dst.ip == Table1.ip_destination).\
        outerjoin(src, src.ip == Table1.ip_source).\
        filter(Table1.vlan_destination.in_(vlan_list),
               ~session.query(Table3).
                   filter_by(usage_source=usage_ip_src,
                             usage_destination=usage,
                             protocol=protocol,
                             application=Table1.application,
                             destination_port=Table1.destination_port,
                             vlan_destination=Table1.vlan_destination,
                             state=Table1.state).
                   exists())

    stmt = insert(Table3).from_select(
        ["usage_source", "usage_destination", "protocol", "application",
         "destination_port", "vlan_destination", "state"],
        flows)

    return session.execute(stmt)

如果

vlan\u列表

是选择性的，或者换句话说，过滤掉大多数行，那么在数据库中执行的操作就会少得多。根据

Table2

的大小，您可能会从索引

Table2.ip

中获益，但请先进行测试。如果它相对较小，我猜PostgreSQL将在那里执行哈希或嵌套循环联接。如果在

表3

中用于过滤重复项的列是唯一的，您可以执行

插入。。。关于冲突。。。使用不存在
子查询表达式（PostgreSQL将作为反联接执行）在选择中删除重复项，而不执行任何操作。如果流
查询可能会产生重复项，请添加对query.distinct（）
的调用。
您好，感谢您联系！我将在我的新表格中列出一个期望值的示例。我这样做是为了向我的代码发出信号，一旦代码运行完毕，当它到达表1的末尾时停止运行。好的，那么我应该修改我的行，然后使用类似flows=session.query（表1.id>=start\u id.limit（100）。all（）？编辑了我的帖子，如果需要更多详细信息，请告诉我。我与Ilja在一起。即使经过编辑，您的问题仍然没有包含足够的详细信息，无法给出知情的答案。在我看来，这也是一个XY解决方案。也许试着从另一个角度看问题，并修改你提出问题的方式。此外，将模型视为截图也不是最佳选择。您甚至遗漏了代码中使用的一些列。我看到很多地方可以改进这一点。但是如果不清楚预期的结果，就不可能回答。我编辑了我的帖子，试图重新思考我面临的问题。我添加了一个图表，其中列出了我是如何思考它的，以及代码中的所有列，希望它能更好地完成这个任务！！只需添加一个不同的查询以避免重复，我的代码就可以在一瞬间而不是几个小时内运行。非常感谢你！！