Google bigquery 我只想将5GB从MySql加载到BigQuery中_Google Bigquery

Google bigquery 我只想将5GB从MySql加载到BigQuery中

google-bigquery

Google bigquery 我只想将5GB从MySql加载到BigQuery中,google-bigquery,Google Bigquery,好久不见了。我想从MySql获取5GB的数据到BigQuery中。我的最佳选择似乎是某种CSV导出/导入。由于各种原因，它不起作用，请参见： agile-coral-830:splitpapers1501200518aa150120052659 agile-coral-830:splitpapers1501200545aa150120055302 agile-coral-830:splitpapers1501200556aa150120060231 这可能是因为我没有正确的MySql咒语，无法

好久不见了。我想从MySql获取5GB的数据到BigQuery中。我的最佳选择似乎是某种CSV导出/导入。由于各种原因，它不起作用，请参见：

agile-coral-830:splitpapers1501200518aa150120052659
agile-coral-830:splitpapers1501200545aa150120055302
agile-coral-830:splitpapers1501200556aa150120060231

这可能是因为我没有正确的MySql咒语，无法根据RFC4180生成完美的CSV。但是，通过支持可定制的多字符字段分隔符和多字符行分隔符，可以在五分钟内解决整个加载业务，而不是争论RFC 4180细节。我很确定我的数据既不包含#####也不包含@@@，因此以下内容非常有用：

mysql> select * from $TABLE_NAME 
into outfile '$DATA.csv' 
fields terminated by '###' 
enclosed by ''
lines terminated by '@@@'

$ bq load  --nosync -F '###' -E '@@@' $TABLE_NAME $DATA.csv $SCHEMA.json

编辑：字段包含“\n”、“\r”、“、”和“”。它们还包含空值，MySql在示例“n”中表示为[escape]n。示例行：

"10.1.1.1.1483","5","9074080","Candidate high myopia loci on chromosomes 18p and 12q do not play a major role in susceptibility to common myopia","Results
There was no strong evidence of linkage of common myopia to these candidate regions: all two-point and multipoint heterogeneity LOD scores were < 1.0 and non-parametric linkage p-values were > 0.01. However, one Amish family showed slight evidence of linkage (LOD>1.0) on 12q; another 3 Amish families each gave LOD >1.0 on 18p; and 3 Jewish families each gave LOD >1.0 on 12q.
Conclusions
Significant evidence of linkage (LOD> 3) of myopia was not found on chromosome 18p or 12q loci in these families. These results suggest that these loci do not play a major role in the causation of common myopia in our families studied.","2004","BMC MEDICAL GENETICS","JOURNAL","N,"5","20","","","","0","1","USER","2007-11-19 05:00:00","rep1","PDFLib TET","0","2009-05-24 20:33:12"

“10.1.1.1483”、“5”、“9074080”、“染色体18p和12q上的高度近视候选位点在普通近视易感性中不起主要作用”、“结果”
没有强有力的证据表明普通近视与这些候选区域有联系：所有两点和多点异质性LOD得分均<1.0，非参数连锁p值均>0.01。然而，一个阿米什家族显示出轻微的联系（LOD>1.0）在第12季度，另外3个阿米什家庭在第18季度的LOD>1.0；3个犹太家庭在第12季度的LOD>1.0。
结论
在这些家族中的染色体18p或12q基因座上未发现明显的近视连锁（LOD>3）。这些结果表明，这些基因座在我们研究的家族中常见近视的病因中不起主要作用。”，“2004”，“BMC医学遗传学”，“杂志”，“N”，“5”，“20”，“用户”，“2007-11-19 05:00:00”，“rep1”，”PDFLib TET、0、2009-05-2420:33:12“

我发现通过CSV加载非常困难。更多的限制和复杂性。今天早上我一直在把数据从MySQL移动到BigQuery

Bellow是一个Python脚本，它将构建表装饰器并将数据直接流到BigQuery表中

My db位于云中，因此您可能需要更改连接字符串。请针对您的特定情况填写缺少的值，然后通过以下方式调用：

SQLToBQBatch(tableName, limit)

我把限制放在测试中。在我的最后一次测试中，我发送了99999999来获取限制，一切正常

我建议使用后端模块在5g上运行

使用“RowToJSON”清除无效字符（即任何非utf8的字符）

我没有在5gb上测试过，但它能够在大约20秒内完成50k行。CSV中的相同负载超过2分钟

我写这篇文章是为了测试东西，所以请原谅糟糕的编码实践和迷你黑客。它可以工作，所以请随时清理它，以用于任何生产级别的工作

import MySQLdb
import logging
from apiclient.discovery import build
from oauth2client.appengine import AppAssertionCredentials
import httplib2

OAUTH_SCOPE = 'https://www.googleapis.com/auth/bigquery'



PROJECT_ID = 
DATASET_ID = 
TABLE_ID = 

SQL_DATABASE_NAME = 
SQL_DATABASE_DB = 
SQL_USER = 
SQL_PASS = 


def Connect():
    return MySQLdb.connect(unix_socket='/cloudsql/' + SQL_DATABASE_NAME, db=SQL_DATABASE_DB, user=SQL_USER, passwd=SQL_PASS)


def RowToJSON(cursor, row, fields):
    newData = {}
    for i, value in enumerate(row):
        try:
            if fields[i]["type"] == bqTypeDict["int"]:
                value = int(value)
            else:
                value = float(value)
        except:
            if value is not None:
                value = value.replace("\x92", "'") \
                                .replace("\x96", "'") \
                                .replace("\x93", '"') \
                                .replace("\x94", '"') \
                                .replace("\x97", '-') \
                                .replace("\xe9", 'e') \
                                .replace("\x91", "'") \
                                .replace("\x85", "...") \
                                .replace("\xb4", "'") \
                                .replace('"', '""')

        newData[cursor.description[i][0]] = value
    return newData


def GetBuilder():
    return build('bigquery', 'v2',http = AppAssertionCredentials(scope=OAUTH_SCOPE).authorize(httplib2.Http()))

bqTypeDict = { 'int' : 'INTEGER',
                   'varchar' : 'STRING',
                   'double' : 'FLOAT',
                   'tinyint' : 'INTEGER',
                   'decimal' : 'FLOAT',
                   'text' : 'STRING',
                   'smallint' : 'INTEGER',
                   'char' : 'STRING',
                   'bigint' : 'INTEGER',
                   'float' : 'FLOAT',
                   'longtext' : 'STRING'
                  }

def BuildFeilds(table):
    conn = Connect()
    cursor = conn.cursor()
    cursor.execute("DESCRIBE %s;" % table)
    tableDecorator = cursor.fetchall()
    fields = []

    for col in tableDecorator:
        field = {}
        field["name"] = col[0]
        colType = col[1].split("(")[0]
        if colType not in bqTypeDict:
            logging.warning("Unknown type detected, using string: %s", str(col[1]))
        field["type"] = bqTypeDict.get(colType, "STRING")
        if col[2] == "YES":
            field["mode"] = "NULLABLE"
        fields.append(field)
    return fields


def SQLToBQBatch(table, limit=3000):
    logging.info("****************************************************")
    logging.info("Starting SQLToBQBatch. Got: Table: %s, Limit: %i" % (table, limit))   
    bqDest = GetBuilder()
    fields = BuildFeilds(table)

    try:
        responce = bqDest.datasets().insert(projectId=PROJECT_ID, body={'datasetReference' : 
                                                                {'datasetId' : DATASET_ID} }).execute()
        logging.info("Added Dataset")
        logging.info(responce)
    except Exception, e:
        logging.info(e)
        if ("Already Exists: " in str(e)):
            logging.info("Dataset already exists")
        else:
            logging.error("Error creating dataset: " + str(e), "Error")

    try:
        responce = bqDest.tables().insert(projectId=PROJECT_ID, datasetId=DATASET_ID, body={'tableReference' : {'projectId'  : PROJECT_ID,
                                                                                               'datasetId' : DATASET_ID,
                                                                                               'tableId'  : TABLE_ID},
                                                                            'schema' : {'fields' : fields}}
                                                                                ).execute()
        logging.info("Added Table")
        logging.info(responce)
    except Exception, e:
        logging.info(e)
        if ("Already Exists: " in str(e)):
            logging.info("Table already exists")
        else:
            logging.error("Error creating table: " + str(e), "Error")

    conn = Connect()
    cursor = conn.cursor()

    logging.info("Starting load loop")
    count = -1
    cur_pos = 0
    total = 0
    batch_size = 1000

    while count != 0 and cur_pos < limit:
        count = 0
        if batch_size + cur_pos > limit:
            batch_size = limit - cur_pos
        sqlCommand = "SELECT * FROM %s LIMIT %i, %i" % (table, cur_pos, batch_size) 
        logging.info("Running: %s", sqlCommand)
        cursor.execute(sqlCommand)
        data = []
        for _, row in enumerate(cursor.fetchall()):
            data.append({"json": RowToJSON(cursor, row, fields)})
            count += 1
        logging.info("Read complete")

        if count != 0:

            logging.info("Sending request")   
            insertResponse = bqDest.tabledata().insertAll(
                                                        projectId=PROJECT_ID,
                                                        datasetId=DATASET_ID,
                                                        tableId=TABLE_ID,
                                                        body={"rows":data}).execute()
            cur_pos += batch_size
            total += count
            logging.info("Done %i, Total: %i, Response: %s", count, total, insertResponse)
            if "insertErrors" in insertResponse:
                logging.error("Error inserting data index: %i", insertResponse["insertErrors"]["index"])
                for error in insertResponse["insertErrors"]["errors"]:
                    logging.error(error)
        else:
            logging.info("No more rows")

导入MySQLdb
导入日志记录
从apiclient.discovery导入生成
从oauth2client.appengine导入AppAssertionCredentials
导入httplib2
OAUTH_示波器https://www.googleapis.com/auth/bigquery'
项目ID=
数据集\u ID=
表_ID=
SQL\u数据库\u名称=
SQL_数据库_数据库=
SQL\u用户=
SQL_PASS=
def Connect（）：
返回MySQLdb.connect（unix\u socket='/cloudsql/'+SQL\u DATABASE\u NAME，db=SQL\u DATABASE\u db，user=SQL\u user，passwd=SQL\u PASS）
def RowToJSON（光标、行、字段）：
newData={}
对于i，枚举中的值（行）：
尝试：
如果字段[i][“type”]==bqTypeDict[“int”]：
value=int（值）
其他：
值=浮动（值）
除：
如果值不是“无”：
值=值。替换（“\x92”和“”）\
.替换（“\x96”和“”）\
.替换（“\x93”和“”）\
.替换（“\x94”和“”）\
.替换（“\x97”和“-”）\
.replace（“\xe9”和“e”）\
.替换（“\x91”和“”）\
.替换（“\x85”和“…”）\
.替换（“\xb4”和“”）\
.replace（“，”替换）
newData[cursor.description[i][0]]=值
返回新数据
def GetBuilder（）：
返回生成（'bigquery'，'v2'，http=AppAssertionCredentials（scope=OAUTH_scope）.authorize（httplib2.http（））
bqTypeDict={'int'：'INTEGER'，
'varchar'：'STRING'，
“double”：“FLOAT”，
'tinyint'：'INTEGER'，
'十进制'：'浮点'，
“文本”：“字符串”，
'smallint'：'INTEGER'，
'字符'：'字符串'，
‘bigint’：‘INTEGER’，
'浮动'：'浮动'，
“longtext”：“STRING”
}
def BuildFeilds（表）：
conn=Connect（）
游标=连接游标（）
cursor.execute（“描述%s；%table”）
tableDecorator=cursor.fetchall（）
字段=[]
对于col in tableDecorator：
字段={}
字段[“名称”]=列[0]
colType=col[1]。拆分（“”[0]
如果colType不在bqTypeDict中：
logging.warning（“检测到未知类型，使用字符串：%s”，str（列[1]））
字段[“type”]=bqTypeDict.get（colType，“STRING”）
如果列[2]=“是”：
字段[“模式”]=“可为空”
字段。追加（字段）
返回字段
def SQLToBQBatch（表格，限制=3000）：
logging.info（“*************************************************************”）
logging.info（“启动SQLToBQBatch.get:表：%s，限制：%i”%（表，限制））
bqDest=GetBuilder（）
字段=BuildFeilds（表）
尝试：
response=bqDest.datasets（）.insert（projectd=PROJECT_ID，body={'datasetReference'：
{'datasetId'：数据集\u ID}）.execute（）
logging.info（“添加的数据集”）
logging.info（responce）
除例外情况外，e：
logging.info（e）
如果（“已存在：”在str（e）中）：
logging.info（“数据集已存在”）
其他：
洛金