Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/mysql/55.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
低InnoDB每秒写入数-使用Python将AWS EC2写入MySQL RDS_Python_Mysql_Amazon Web Services_Amazon Ec2 - Fatal编程技术网

低InnoDB每秒写入数-使用Python将AWS EC2写入MySQL RDS

低InnoDB每秒写入数-使用Python将AWS EC2写入MySQL RDS,python,mysql,amazon-web-services,amazon-ec2,Python,Mysql,Amazon Web Services,Amazon Ec2,我有大约60GB的JSON文件,我正在使用Python解析这些文件,然后使用Python MySQL连接器将其插入MySQL数据库。每个JSON文件大约为500MB 我一直在使用一个带有辅助卷的AWS r3.xlarge EC2实例来保存60GB的JSON数据 然后我使用的是AWS RDS r3.xlarge MySQL实例。这些实例都位于同一区域和可用性区域。EC2实例使用以下Python脚本加载JSON,解析它,然后将其插入MySQL RDS。我的python: import json im

我有大约60GB的JSON文件,我正在使用Python解析这些文件,然后使用Python MySQL连接器将其插入MySQL数据库。每个JSON文件大约为500MB

我一直在使用一个带有辅助卷的AWS r3.xlarge EC2实例来保存60GB的JSON数据

然后我使用的是AWS RDS r3.xlarge MySQL实例。这些实例都位于同一区域和可用性区域。EC2实例使用以下Python脚本加载JSON,解析它,然后将其插入MySQL RDS。我的python:

import json
import mysql.connector
from mysql.connector import errorcode
from pprint import pprint
import glob
import os

os.chdir("./json_data")

for file in glob.glob("*.json"):
    with open(file, 'rU') as data_file:
        results = json.load(data_file)
        print('working on file:', file)

    cnx = mysql.connector.connect(user='', password='',
        host='')

    cursor = cnx.cursor(buffered=True)

    DB_NAME = 'DB'

    def create_database(cursor):
        try:
            cursor.execute(
                "CREATE DATABASE {} DEFAULT CHARACTER SET 'utf8'".format(DB_NAME))
        except mysql.connector.Error as err:
            print("Failed creating database: {}".format(err))
            exit(1)

    try:
        cnx.database = DB_NAME    
    except mysql.connector.Error as err:
        if err.errno == errorcode.ER_BAD_DB_ERROR:
            create_database(cursor)
            cnx.database = DB_NAME
        else:
            print(err)
            exit(1)

    add_overall_data = ("INSERT INTO master" 
        "(_sent_time_stamp, dt, ds, dtf, O_l, O_ln, O_Ls, O_a, D_l, D_ln, d_a)"
        "VALUES (%(_sent_time_stamp)s, %(dt)s, %(ds)s, %(dtf)s, %(O_l)s, %(O_ln)s, %(O_Ls)s, %(O_a)s, %(D_l)s, %(D_ln)s, %(d_a)s)")

    add_polyline = ("INSERT INTO polyline"
        "(Overview_polyline, request_no)"
        "VALUES (%(Overview_polyline)s, %(request_no)s)")

    add_summary = ("INSERT INTO summary"
        "(summary, request_no)"
        "VALUES (%(summary)s, %(request_no)s)")

    add_warnings = ("INSERT INTO warnings"
        "(warnings, request_no)"
        "VALUES (%(warnings)s, %(request_no)s)")

    add_waypoint_order = ("INSERT INTO waypoint_order"
        "(waypoint_order, request_no)"
        "VALUES (%(waypoint_order)s, %(request_no)s)")

    add_leg_data = ("INSERT INTO leg_data"
        "(request_no, leg_dt, leg_ds, leg_O_l, leg_O_ln, leg_D_l, leg_D_ln, leg_html_inst, leg_polyline, leg_travel_mode)" 
        "VALUES (%(request_no)s, %(leg_dt)s, %(leg_ds)s, %(leg_O_l)s, %(leg_O_ln)s, %(leg_D_l)s, %(leg_D_ln)s, %(leg_html_inst)s, %(leg_polyline)s, %(leg_travel_mode)s)")
    error_messages = []
    for result in results:
        if result["status"] == "OK":
            for leg in result['routes'][0]['legs']:
                try: 
                    params = {
                    "_sent_time_stamp": leg['_sent_time_stamp'],
                    "dt": leg['dt']['value'],
                    "ds": leg['ds']['value'],
                    "dtf": leg['dtf']['value'],
                    "O_l": leg['start_location']['lat'],
                    "O_ln": leg['start_location']['lng'],
                    "O_Ls": leg['O_Ls'],
                    "O_a": leg['start_address'],
                    "D_l": leg['end_location']['lat'],
                    "D_ln": leg['end_location']['lng'],
                    "d_a": leg['end_address']
                    }
                    cursor.execute(add_overall_data, params)
                    query = ('SELECT request_no FROM master WHERE O_l = %s AND O_ln = %s AND D_l = %s AND D_ln = %s AND _sent_time_stamp = %s')
                    O_l = leg['start_location']['lat']
                    O_ln = leg['start_location']['lng']
                    D_l = leg['end_location']['lat']
                    D_ln = leg['end_location']['lng']
                    _sent_time_stamp = leg['_sent_time_stamp']
                    cursor.execute(query,(O_l, O_ln, D_l, D_ln, _sent_time_stamp))
                    request_no = cursor.fetchone()[0]
                except KeyError, e:
                    error_messages.append(e)
                    params = {
                    "_sent_time_stamp": leg['_sent_time_stamp'],
                    "dt": leg['dt']['value'],
                    "ds": leg['ds']['value'],
                    "dtf": "000",
                    "O_l": leg['start_location']['lat'],
                    "O_ln": leg['start_location']['lng'],
                    "O_Ls": leg['O_Ls'],
                    "O_a": 'unknown',
                    "D_l": leg['end_location']['lat'],
                    "D_ln": leg['end_location']['lng'],
                    "d_a": 'unknown'
                    }
                    cursor.execute(add_overall_data, params)
                    query = ('SELECT request_no FROM master WHERE O_l = %s AND O_ln = %s AND D_l = %s AND D_ln = %s AND _sent_time_stamp = %s')
                    O_l = leg['start_location']['lat']
                    O_ln = leg['start_location']['lng']
                    D_l = leg['end_location']['lat']
                    D_ln = leg['end_location']['lng']
                    _sent_time_stamp = leg['_sent_time_stamp']
                    cursor.execute(query,(O_l, O_ln, D_l, D_ln, _sent_time_stamp))
                    request_no = cursor.fetchone()[0]
            for overview_polyline in result['routes']:
                params = {
                "request_no": request_no,
                "Overview_polyline": overview_polyline['overview_polyline']['points']
                }
                cursor.execute(add_polyline, params)
                query = ('SELECT request_no FROM master WHERE O_l = %s AND O_ln = %s AND D_l = %s AND D_ln = %s AND _sent_time_stamp = %s')
                O_l = leg['start_location']['lat']
                O_ln = leg['start_location']['lng']
                D_l = leg['end_location']['lat']
                D_ln = leg['end_location']['lng']
                _sent_time_stamp = leg['_sent_time_stamp']
                cursor.execute(query,(O_l, O_ln, D_l, D_ln, _sent_time_stamp))
                request_no = cursor.fetchone()[0]
            for summary in result['routes']:
                params = {
                "request_no": request_no,
                "summary": summary['summary']
                }
                cursor.execute(add_summary, params)
                query = ('SELECT request_no FROM master WHERE O_l = %s AND O_ln = %s AND D_l = %s AND D_ln = %s AND _sent_time_stamp = %s')
                O_l = leg['start_location']['lat']
                O_ln = leg['start_location']['lng']
                D_l = leg['end_location']['lat']
                D_ln = leg['end_location']['lng']
                _sent_time_stamp = leg['_sent_time_stamp']
                cursor.execute(query,(O_l, O_ln, D_l, D_ln, _sent_time_stamp))
                request_no = cursor.fetchone()[0]
            for warnings in result['routes']:
                params = {
                "request_no": request_no,
                "warnings": str(warnings['warnings'])
                }
                cursor.execute(add_warnings, params)
                query = ('SELECT request_no FROM master WHERE O_l = %s AND O_ln = %s AND D_l = %s AND D_ln = %s AND _sent_time_stamp = %s')
                O_l = leg['start_location']['lat']
                O_ln = leg['start_location']['lng']
                D_l = leg['end_location']['lat']
                D_ln = leg['end_location']['lng']
                _sent_time_stamp = leg['_sent_time_stamp']
                cursor.execute(query,(O_l, O_ln, D_l, D_ln, _sent_time_stamp))
                request_no = cursor.fetchone()[0]
            for waypoint_order in result['routes']:
                params = {
                "request_no": request_no,
                "waypoint_order": str(waypoint_order['waypoint_order'])
                }
                cursor.execute(add_waypoint_order, params)
                query = ('SELECT request_no FROM master WHERE O_l = %s AND O_ln = %s AND D_l = %s AND D_ln = %s AND _sent_time_stamp = %s')
                O_l = leg['start_location']['lat']
                O_ln = leg['start_location']['lng']
                D_l = leg['end_location']['lat']
                D_ln = leg['end_location']['lng']
                _sent_time_stamp = leg['_sent_time_stamp']
                cursor.execute(query,(O_l, O_ln, D_l, D_ln, _sent_time_stamp))
                request_no = cursor.fetchone()[0]
            for steps in result['routes'][0]['legs'][0]['steps']:
                params = {
                "request_no": request_no,
                "leg_dt": steps['dt']['value'],
                "leg_ds": steps['ds']['value'],
                "leg_O_l": steps['start_location']['lat'],
                "leg_O_ln": steps['start_location']['lng'],
                "leg_D_l": steps['end_location']['lat'],
                "leg_D_ln": steps['end_location']['lng'],
                "leg_html_inst": steps['html_instructions'],
                "leg_polyline": steps['polyline']['points'],
                "leg_travel_mode": steps['travel_mode']
                }
                cursor.execute(add_leg_data, params)
        cnx.commit()
    print('error messages:', error_messages)
    cursor.close()
    cnx.close()
    print('finished' + file)
在Linux实例上使用htop,我可以看到以下内容:

关于MySQL数据库,使用MySQL Workbench我可以看到:

这个python脚本已经运行了好几天,但我只将大约20%的数据插入MySQL

我的问题-如何识别瓶颈?是Python脚本吗?它似乎使用的内存量很低-我可以增加这个吗?我根据检查了InnoDB缓冲池大小,发现它很大:

SELECT @@innodb_buffer_pool_size;
+---------------------------+
| @@innodb_buffer_pool_size |
+---------------------------+
|               11674845184 |
+---------------------------+
由于我在同一地区使用RDS和EC2实例,我不认为存在网络瓶颈。我应该在哪里寻找最大的节省将是非常欢迎的指针

编辑

我想我可能偶然发现了这个问题。为了在解析过程中提高效率,我将分别编写每一级JSON。但是,我必须执行一个查询,以将JSON的嵌套部分与其更高级别匹配。使用小型数据库时,此查询的开销较低。我注意到在这个db上,插入的速度急剧下降。这是因为它必须搜索更大且不断增长的数据库才能正确连接JSON数据


我不知道如何解决这个问题,除了等待它结束……

我在Python脚本中看不到任何表定义。。。。但是,当我们尝试执行大型数据操作时——加载到MySQL时,我们总是会禁用任何数据库索引——如果您有任何约束/外键强制执行的话——也应该在加载时禁用

通过连接器/Python连接时,默认情况下禁用自动提交

但是在您提供的代码中,我看不到任何提交选项

概括

禁用/删除以进行加载

-索引 -约束条件 -外键 -触发

在加载程序中

-禁用自动提交 -提交n个记录n将取决于可用的缓冲区大小

我的英语很差

如果我做这项工作,我会的

使用python将json转换为txt

使用mysqimp工具,将txt导入mysql

如果您必须使用python+mysql allinone,我建议您使用

insert table values(1),value(2)...value(xxx)  
为什么应该从json中读取多次出现的“从主机选择请求\u no”


我的英语很差,所以

根据这些信息,看起来脚本和数据库大部分都是空闲的。在MySQL级别调整任何内容都为时过早

您需要更清楚地了解您的程序正在做什么

首先,记录每个查询所花费的时间、出现的错误等


如果这是一个问题,这些选择可能需要添加一个索引才能很好地执行。

您提到EC2和RDS位于同一区域;它们是否也在同一可用性区域?如果没有,这可能是一个很容易看到进一步改进的方法。是的,考虑到这一点。它们都位于同一可用性区域您是否尝试在RDS实例上使用已设置的IOPs?我使用的是:已设置IOPs SSD,IOPs=1000和200GB存储。我不确定是否可以提高IOPs,但现在将进行研究…请使用显式事务,否则很难理解。