Python 为什么EC2执行这些SQLite查询要花这么长时间？_Python_Django_Amazon Ec2_Sqlite

Python 为什么EC2执行这些SQLite查询要花这么长时间？

python django amazon-ec2 sqlite

Python 为什么EC2执行这些SQLite查询要花这么长时间？,python,django,amazon-ec2,sqlite,Python,Django,Amazon Ec2,Sqlite,我正在编写一个脚本，我的一个队友用它在他的机器上本地构建SQLite3数据库。我做了一些修改，以便我们可以在Django应用程序中使用它，用用户上传的新数据更新数据库。该应用程序允许用户上传包含多个格式良好的csv文件的zip文件，并将csv中的信息添加到数据库中。以下是守则的相关部分：更新\u db.py import glob, sqlite3, pandas, timeit, re def upload_files(csv_files): conn = sqlite3.con

我正在编写一个脚本，我的一个队友用它在他的机器上本地构建SQLite3数据库。我做了一些修改，以便我们可以在Django应用程序中使用它，用用户上传的新数据更新数据库。该应用程序允许用户上传包含多个格式良好的csv文件的zip文件，并将csv中的信息添加到数据库中。以下是守则的相关部分：

更新\u db.py

import glob, sqlite3, pandas, timeit, re

def upload_files(csv_files):

    conn = sqlite3.connect('/path/to/my_db.db')
    c = conn.cursor()

    added_tables = []

    for row in c.execute("SELECT name FROM sqlite_master WHERE type='table'"):
            table_name = re.sub(r'\W+', '', str(row))
            added_tables.append(table_name)

    for csv_filename in csv_files.namelist():
        if csv_filename.endswith('.csv'):
            csv_file = csv_files.open(csv_filename)

            # extract team name from csv_file string, remove whitespace
            table_name = csv_filename.rsplit('/',2)[1]
            table_name = re.sub('[^\w+]', '', table_name)


            try:
                df = pandas.read_csv(csv_file, error_bad_lines=False)
                df.to_sql(table_name, conn, if_exists='append', index=False)

                if table_name not in added_tables:
                    # add necessary columns

                    c.execute('alter table ' + str(table_name) + ' add team_BASEDOWN integer;')
                    c.execute('alter table ' + str(table_name) + ' add team_FIELDPOSITION integer;')
                    c.execute('alter table ' + str(table_name) + ' add team_HEADCOACH text;')
                    c.execute('alter table ' + str(table_name) + ' add team_OFFCOOR text;')
                    c.execute('alter table ' + str(table_name) + ' add team_DEFFCOOR text;')
                    added_tables.append(table_name)

                # set basedown
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 0 where pff_DOWN = 1 or (pff_DOWN = 2 and pff_DISTANCE <= 6);')
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 1 where pff_DOWN = 2 and pff_DISTANCE >= 7;')
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 2 where pff_DOWN = 3 and pff_DISTANCE <= 2;')
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 3 where pff_DOWN = 3 and pff_DISTANCE = 3;')
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 4 where pff_DOWN = 3 and pff_DISTANCE >= 4 and pff_DISTANCE <= 6;')
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 5 where pff_DOWN = 3 and pff_DISTANCE >= 7;')
                c.execute('update ' + str(table_name) + ' set team_BASEDOWN = 6 where pff_DOWN = 4;')

                # set fieldposition
                c.execute('update ' + str(table_name) + ' set team_FIELDPOSITION = 0 where pff_FIELDPOSITION <= -1 and pff_FIELDPOSITION >= -10;')
                c.execute('update ' + str(table_name) + ' set team_FIELDPOSITION = 1 where pff_FIELDPOSITION <= -11 or (pff_FIELDPOSITION >= 20 and pff_FIELDPOSITION <= 50);')
                c.execute('update ' + str(table_name) + ' set team_FIELDPOSITION = 2 where pff_FIELDPOSITION >= 12 and pff_FIELDPOSITION <= 20;')
                c.execute('update ' + str(table_name) + ' set team_FIELDPOSITION = 3 where pff_FIELDPOSITION >= 6 and pff_FIELDPOSITION <= 11;')
                c.execute('update ' + str(table_name) + ' set team_FIELDPOSITION = 4 where pff_FIELDPOSITION <= 5;')

            except pandas.errors.EmptyDataError as ex:
                print(str(csv_file) + ' was empty; continuing...')
                continue;

    conn.commit()
    conn.close()

from django.shortcuts import render
from django.db import connections
from django.db.utils import OperationalError
from django.http import HttpResponse
from django.template import loader
from django.conf import settings
from django.utils.encoding import smart_str
from webapp.update_db import upload_files
from threading import Thread
import numpy as np
import zipfile

def upload(request):
    if request.method == 'POST' and request.FILES['myfile']:
        myfile = request.FILES['myfile']
        if str(myfile.name).endswith('.zip'):
            unzipped = zipfile.ZipFile(myfile)
            upload_files(unzipped)
    return render(request, 'webapp/upload.html')

我的问题是，当我提交zip文件时，上传需要花很长的时间来处理（处理160MB的zip文件大约需要12个小时）。我觉得SQL查询可能更有效，但他说，当他在本地运行它时，构建整个数据库只需要大约45分钟（比我们预期的“更新”要大得多），所以我想知道运行应用程序的EC2实例是否有什么奇怪的事。我检查了实例上的CPU利用率，结果显示，在更新脚本运行期间，CPU利用率始终保持在20%的平均水平（也没有值得注意的峰值或谷值）。我不确定在本地运行实例与在EC2上运行实例之间会发生什么变化，因此，对于修改实例或脚本以提高性能的任何建议都将不胜感激

您可以使用单个命令更新所有情况下的两列：

更新MyTable
设置团队基础=案例
当pff_DOWN=1或（pff_DOWN=2，pff_DISTANCE=7）时，则为1
当pff_DOWN=3，pff_DISTANCE=4，pff_DISTANCE=7时，则为5
当pff_DOWN=4时，则为6
其他球队下场
完,，
团队位置=案例
当pff_FIELDPOSITION=-10时，则为0
当pff_FIELDPOSITION=20、pff_FIELDPOSITION=12、pff_FIELDPOSITION=6和pff_FIELDPOSITION=6时，您运行此操作的实例类型是什么？该实例的容量很可能比您的队友小。我无法解释为什么对您的朋友来说更快，但运行所有这些alter table查询永远都不会高效和高效随着数据库的增长，会变得越来越少。我不明白你为什么要这样做，但肯定有更好的方法来实现你想要实现的目标。还有一个问题是，所有这些列显然都没有索引。在每个查询之间添加一些基准测试，看看你能从中了解到哪些部分比较慢……当然还有当然，检查您的EBS卷上的CloudWatch指标。