Google bigquery 我只想将5GB从MySql加载到BigQuery中

Google bigquery 我只想将5GB从MySql加载到BigQuery中,google-bigquery,Google Bigquery,好久不见了。我想从MySql获取5GB的数据到BigQuery中。我的最佳选择似乎是某种CSV导出/导入。由于各种原因,它不起作用,请参见: agile-coral-830:splitpapers1501200518aa150120052659 agile-coral-830:splitpapers1501200545aa150120055302 agile-coral-830:splitpapers1501200556aa150120060231 这可能是因为我没有正确的MySql咒语,无法

好久不见了。我想从MySql获取5GB的数据到BigQuery中。我的最佳选择似乎是某种CSV导出/导入。由于各种原因,它不起作用,请参见:

agile-coral-830:splitpapers1501200518aa150120052659
agile-coral-830:splitpapers1501200545aa150120055302
agile-coral-830:splitpapers1501200556aa150120060231
这可能是因为我没有正确的MySql咒语,无法根据RFC4180生成完美的CSV。但是,通过支持可定制的多字符字段分隔符和多字符行分隔符,可以在五分钟内解决整个加载业务,而不是争论RFC 4180细节。我很确定我的数据既不包含#####也不包含@@@,因此以下内容非常有用:

mysql> select * from $TABLE_NAME 
into outfile '$DATA.csv' 
fields terminated by '###' 
enclosed by ''
lines terminated by '@@@'

$ bq load  --nosync -F '###' -E '@@@' $TABLE_NAME $DATA.csv $SCHEMA.json
编辑:字段包含“\n”、“\r”、“、”和“”。它们还包含空值,MySql在示例“n”中表示为[escape]n。示例行:

"10.1.1.1.1483","5","9074080","Candidate high myopia loci on chromosomes 18p and 12q do not play a major role in susceptibility to common myopia","Results
There was no strong evidence of linkage of common myopia to these candidate regions: all two-point and multipoint heterogeneity LOD scores were < 1.0 and non-parametric linkage p-values were > 0.01. However, one Amish family showed slight evidence of linkage (LOD>1.0) on 12q; another 3 Amish families each gave LOD >1.0 on 18p; and 3 Jewish families each gave LOD >1.0 on 12q.
Conclusions
Significant evidence of linkage (LOD> 3) of myopia was not found on chromosome 18p or 12q loci in these families. These results suggest that these loci do not play a major role in the causation of common myopia in our families studied.","2004","BMC MEDICAL GENETICS","JOURNAL","N,"5","20","","","","0","1","USER","2007-11-19 05:00:00","rep1","PDFLib TET","0","2009-05-24 20:33:12"
“10.1.1.1483”、“5”、“9074080”、“染色体18p和12q上的高度近视候选位点在普通近视易感性中不起主要作用”、“结果”
没有强有力的证据表明普通近视与这些候选区域有联系:所有两点和多点异质性LOD得分均<1.0,非参数连锁p值均>0.01。然而,一个阿米什家族显示出轻微的联系(LOD>1.0)在第12季度,另外3个阿米什家庭在第18季度的LOD>1.0;3个犹太家庭在第12季度的LOD>1.0。
结论
在这些家族中的染色体18p或12q基因座上未发现明显的近视连锁(LOD>3)。这些结果表明,这些基因座在我们研究的家族中常见近视的病因中不起主要作用。”,“2004”,“BMC医学遗传学”,“杂志”,“N”,“5”,“20”,“用户”,“2007-11-19 05:00:00”,“rep1”,”PDFLib TET、0、2009-05-2420:33:12“

我发现通过CSV加载非常困难。更多的限制和复杂性。今天早上我一直在把数据从MySQL移动到BigQuery

Bellow是一个Python脚本,它将构建表装饰器并将数据直接流到BigQuery表中

My db位于云中,因此您可能需要更改连接字符串。请针对您的特定情况填写缺少的值,然后通过以下方式调用:

SQLToBQBatch(tableName, limit)
我把限制放在测试中。在我的最后一次测试中,我发送了99999999来获取限制,一切正常

我建议使用后端模块在5g上运行

使用“RowToJSON”清除无效字符(即任何非utf8的字符)

我没有在5gb上测试过,但它能够在大约20秒内完成50k行。CSV中的相同负载超过2分钟

我写这篇文章是为了测试东西,所以请原谅糟糕的编码实践和迷你黑客。它可以工作,所以请随时清理它,以用于任何生产级别的工作

import MySQLdb
import logging
from apiclient.discovery import build
from oauth2client.appengine import AppAssertionCredentials
import httplib2

OAUTH_SCOPE = 'https://www.googleapis.com/auth/bigquery'



PROJECT_ID = 
DATASET_ID = 
TABLE_ID = 

SQL_DATABASE_NAME = 
SQL_DATABASE_DB = 
SQL_USER = 
SQL_PASS = 


def Connect():
    return MySQLdb.connect(unix_socket='/cloudsql/' + SQL_DATABASE_NAME, db=SQL_DATABASE_DB, user=SQL_USER, passwd=SQL_PASS)


def RowToJSON(cursor, row, fields):
    newData = {}
    for i, value in enumerate(row):
        try:
            if fields[i]["type"] == bqTypeDict["int"]:
                value = int(value)
            else:
                value = float(value)
        except:
            if value is not None:
                value = value.replace("\x92", "'") \
                                .replace("\x96", "'") \
                                .replace("\x93", '"') \
                                .replace("\x94", '"') \
                                .replace("\x97", '-') \
                                .replace("\xe9", 'e') \
                                .replace("\x91", "'") \
                                .replace("\x85", "...") \
                                .replace("\xb4", "'") \
                                .replace('"', '""')

        newData[cursor.description[i][0]] = value
    return newData


def GetBuilder():
    return build('bigquery', 'v2',http = AppAssertionCredentials(scope=OAUTH_SCOPE).authorize(httplib2.Http()))

bqTypeDict = { 'int' : 'INTEGER',
                   'varchar' : 'STRING',
                   'double' : 'FLOAT',
                   'tinyint' : 'INTEGER',
                   'decimal' : 'FLOAT',
                   'text' : 'STRING',
                   'smallint' : 'INTEGER',
                   'char' : 'STRING',
                   'bigint' : 'INTEGER',
                   'float' : 'FLOAT',
                   'longtext' : 'STRING'
                  }

def BuildFeilds(table):
    conn = Connect()
    cursor = conn.cursor()
    cursor.execute("DESCRIBE %s;" % table)
    tableDecorator = cursor.fetchall()
    fields = []

    for col in tableDecorator:
        field = {}
        field["name"] = col[0]
        colType = col[1].split("(")[0]
        if colType not in bqTypeDict:
            logging.warning("Unknown type detected, using string: %s", str(col[1]))
        field["type"] = bqTypeDict.get(colType, "STRING")
        if col[2] == "YES":
            field["mode"] = "NULLABLE"
        fields.append(field)
    return fields


def SQLToBQBatch(table, limit=3000):
    logging.info("****************************************************")
    logging.info("Starting SQLToBQBatch. Got: Table: %s, Limit: %i" % (table, limit))   
    bqDest = GetBuilder()
    fields = BuildFeilds(table)

    try:
        responce = bqDest.datasets().insert(projectId=PROJECT_ID, body={'datasetReference' : 
                                                                {'datasetId' : DATASET_ID} }).execute()
        logging.info("Added Dataset")
        logging.info(responce)
    except Exception, e:
        logging.info(e)
        if ("Already Exists: " in str(e)):
            logging.info("Dataset already exists")
        else:
            logging.error("Error creating dataset: " + str(e), "Error")

    try:
        responce = bqDest.tables().insert(projectId=PROJECT_ID, datasetId=DATASET_ID, body={'tableReference' : {'projectId'  : PROJECT_ID,
                                                                                               'datasetId' : DATASET_ID,
                                                                                               'tableId'  : TABLE_ID},
                                                                            'schema' : {'fields' : fields}}
                                                                                ).execute()
        logging.info("Added Table")
        logging.info(responce)
    except Exception, e:
        logging.info(e)
        if ("Already Exists: " in str(e)):
            logging.info("Table already exists")
        else:
            logging.error("Error creating table: " + str(e), "Error")

    conn = Connect()
    cursor = conn.cursor()

    logging.info("Starting load loop")
    count = -1
    cur_pos = 0
    total = 0
    batch_size = 1000

    while count != 0 and cur_pos < limit:
        count = 0
        if batch_size + cur_pos > limit:
            batch_size = limit - cur_pos
        sqlCommand = "SELECT * FROM %s LIMIT %i, %i" % (table, cur_pos, batch_size) 
        logging.info("Running: %s", sqlCommand)
        cursor.execute(sqlCommand)
        data = []
        for _, row in enumerate(cursor.fetchall()):
            data.append({"json": RowToJSON(cursor, row, fields)})
            count += 1
        logging.info("Read complete")

        if count != 0:

            logging.info("Sending request")   
            insertResponse = bqDest.tabledata().insertAll(
                                                        projectId=PROJECT_ID,
                                                        datasetId=DATASET_ID,
                                                        tableId=TABLE_ID,
                                                        body={"rows":data}).execute()
            cur_pos += batch_size
            total += count
            logging.info("Done %i, Total: %i, Response: %s", count, total, insertResponse)
            if "insertErrors" in insertResponse:
                logging.error("Error inserting data index: %i", insertResponse["insertErrors"]["index"])
                for error in insertResponse["insertErrors"]["errors"]:
                    logging.error(error)
        else:
            logging.info("No more rows")
导入MySQLdb
导入日志记录
从apiclient.discovery导入生成
从oauth2client.appengine导入AppAssertionCredentials
导入httplib2
OAUTH_示波器https://www.googleapis.com/auth/bigquery'
项目ID=
数据集\u ID=
表_ID=
SQL\u数据库\u名称=
SQL_数据库_数据库=
SQL\u用户=
SQL_PASS=
def Connect():
返回MySQLdb.connect(unix\u socket='/cloudsql/'+SQL\u DATABASE\u NAME,db=SQL\u DATABASE\u db,user=SQL\u user,passwd=SQL\u PASS)
def RowToJSON(光标、行、字段):
newData={}
对于i,枚举中的值(行):
尝试:
如果字段[i][“type”]==bqTypeDict[“int”]:
value=int(值)
其他:
值=浮动(值)
除:
如果值不是“无”:
值=值。替换(“\x92”和“”)\
.替换(“\x96”和“”)\
.替换(“\x93”和“”)\
.替换(“\x94”和“”)\
.替换(“\x97”和“-”)\
.replace(“\xe9”和“e”)\
.替换(“\x91”和“”)\
.替换(“\x85”和“…”)\
.替换(“\xb4”和“”)\
.replace(“,”替换)
newData[cursor.description[i][0]]=值
返回新数据
def GetBuilder():
返回生成('bigquery','v2',http=AppAssertionCredentials(scope=OAUTH_scope).authorize(httplib2.http())
bqTypeDict={'int':'INTEGER',
'varchar':'STRING',
“double”:“FLOAT”,
'tinyint':'INTEGER',
'十进制':'浮点',
“文本”:“字符串”,
'smallint':'INTEGER',
'字符':'字符串',
‘bigint’:‘INTEGER’,
'浮动':'浮动',
“longtext”:“STRING”
}
def BuildFeilds(表):
conn=Connect()
游标=连接游标()
cursor.execute(“描述%s;%table”)
tableDecorator=cursor.fetchall()
字段=[]
对于col in tableDecorator:
字段={}
字段[“名称”]=列[0]
colType=col[1]。拆分(“”[0]
如果colType不在bqTypeDict中:
logging.warning(“检测到未知类型,使用字符串:%s”,str(列[1]))
字段[“type”]=bqTypeDict.get(colType,“STRING”)
如果列[2]=“是”:
字段[“模式”]=“可为空”
字段。追加(字段)
返回字段
def SQLToBQBatch(表格,限制=3000):
logging.info(“*************************************************************”)
logging.info(“启动SQLToBQBatch.get:表:%s,限制:%i”%(表,限制))
bqDest=GetBuilder()
字段=BuildFeilds(表)
尝试:
response=bqDest.datasets().insert(projectd=PROJECT_ID,body={'datasetReference':
{'datasetId':数据集\u ID}).execute()
logging.info(“添加的数据集”)
logging.info(responce)
除例外情况外,e:
logging.info(e)
如果(“已存在:”在str(e)中):
logging.info(“数据集已存在”)
其他:
洛金