Google bigquery 我只想将5GB从MySql加载到BigQuery中
好久不见了。我想从MySql获取5GB的数据到BigQuery中。我的最佳选择似乎是某种CSV导出/导入。由于各种原因,它不起作用,请参见:Google bigquery 我只想将5GB从MySql加载到BigQuery中,google-bigquery,Google Bigquery,好久不见了。我想从MySql获取5GB的数据到BigQuery中。我的最佳选择似乎是某种CSV导出/导入。由于各种原因,它不起作用,请参见: agile-coral-830:splitpapers1501200518aa150120052659 agile-coral-830:splitpapers1501200545aa150120055302 agile-coral-830:splitpapers1501200556aa150120060231 这可能是因为我没有正确的MySql咒语,无法
agile-coral-830:splitpapers1501200518aa150120052659
agile-coral-830:splitpapers1501200545aa150120055302
agile-coral-830:splitpapers1501200556aa150120060231
这可能是因为我没有正确的MySql咒语,无法根据RFC4180生成完美的CSV。但是,通过支持可定制的多字符字段分隔符和多字符行分隔符,可以在五分钟内解决整个加载业务,而不是争论RFC 4180细节。我很确定我的数据既不包含#####也不包含@@@,因此以下内容非常有用:
mysql> select * from $TABLE_NAME
into outfile '$DATA.csv'
fields terminated by '###'
enclosed by ''
lines terminated by '@@@'
$ bq load --nosync -F '###' -E '@@@' $TABLE_NAME $DATA.csv $SCHEMA.json
编辑:字段包含“\n”、“\r”、“、”和“”。它们还包含空值,MySql在示例“n”中表示为[escape]n。示例行:
"10.1.1.1.1483","5","9074080","Candidate high myopia loci on chromosomes 18p and 12q do not play a major role in susceptibility to common myopia","Results
There was no strong evidence of linkage of common myopia to these candidate regions: all two-point and multipoint heterogeneity LOD scores were < 1.0 and non-parametric linkage p-values were > 0.01. However, one Amish family showed slight evidence of linkage (LOD>1.0) on 12q; another 3 Amish families each gave LOD >1.0 on 18p; and 3 Jewish families each gave LOD >1.0 on 12q.
Conclusions
Significant evidence of linkage (LOD> 3) of myopia was not found on chromosome 18p or 12q loci in these families. These results suggest that these loci do not play a major role in the causation of common myopia in our families studied.","2004","BMC MEDICAL GENETICS","JOURNAL","N,"5","20","","","","0","1","USER","2007-11-19 05:00:00","rep1","PDFLib TET","0","2009-05-24 20:33:12"
“10.1.1.1483”、“5”、“9074080”、“染色体18p和12q上的高度近视候选位点在普通近视易感性中不起主要作用”、“结果”
没有强有力的证据表明普通近视与这些候选区域有联系:所有两点和多点异质性LOD得分均<1.0,非参数连锁p值均>0.01。然而,一个阿米什家族显示出轻微的联系(LOD>1.0)在第12季度,另外3个阿米什家庭在第18季度的LOD>1.0;3个犹太家庭在第12季度的LOD>1.0。
结论
在这些家族中的染色体18p或12q基因座上未发现明显的近视连锁(LOD>3)。这些结果表明,这些基因座在我们研究的家族中常见近视的病因中不起主要作用。”,“2004”,“BMC医学遗传学”,“杂志”,“N”,“5”,“20”,“用户”,“2007-11-19 05:00:00”,“rep1”,”PDFLib TET、0、2009-05-2420:33:12“
我发现通过CSV加载非常困难。更多的限制和复杂性。今天早上我一直在把数据从MySQL移动到BigQuery
Bellow是一个Python脚本,它将构建表装饰器并将数据直接流到BigQuery表中
My db位于云中,因此您可能需要更改连接字符串。请针对您的特定情况填写缺少的值,然后通过以下方式调用:
SQLToBQBatch(tableName, limit)
我把限制放在测试中。在我的最后一次测试中,我发送了99999999来获取限制,一切正常
我建议使用后端模块在5g上运行
使用“RowToJSON”清除无效字符(即任何非utf8的字符)
我没有在5gb上测试过,但它能够在大约20秒内完成50k行。CSV中的相同负载超过2分钟
我写这篇文章是为了测试东西,所以请原谅糟糕的编码实践和迷你黑客。它可以工作,所以请随时清理它,以用于任何生产级别的工作
import MySQLdb
import logging
from apiclient.discovery import build
from oauth2client.appengine import AppAssertionCredentials
import httplib2
OAUTH_SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID =
DATASET_ID =
TABLE_ID =
SQL_DATABASE_NAME =
SQL_DATABASE_DB =
SQL_USER =
SQL_PASS =
def Connect():
return MySQLdb.connect(unix_socket='/cloudsql/' + SQL_DATABASE_NAME, db=SQL_DATABASE_DB, user=SQL_USER, passwd=SQL_PASS)
def RowToJSON(cursor, row, fields):
newData = {}
for i, value in enumerate(row):
try:
if fields[i]["type"] == bqTypeDict["int"]:
value = int(value)
else:
value = float(value)
except:
if value is not None:
value = value.replace("\x92", "'") \
.replace("\x96", "'") \
.replace("\x93", '"') \
.replace("\x94", '"') \
.replace("\x97", '-') \
.replace("\xe9", 'e') \
.replace("\x91", "'") \
.replace("\x85", "...") \
.replace("\xb4", "'") \
.replace('"', '""')
newData[cursor.description[i][0]] = value
return newData
def GetBuilder():
return build('bigquery', 'v2',http = AppAssertionCredentials(scope=OAUTH_SCOPE).authorize(httplib2.Http()))
bqTypeDict = { 'int' : 'INTEGER',
'varchar' : 'STRING',
'double' : 'FLOAT',
'tinyint' : 'INTEGER',
'decimal' : 'FLOAT',
'text' : 'STRING',
'smallint' : 'INTEGER',
'char' : 'STRING',
'bigint' : 'INTEGER',
'float' : 'FLOAT',
'longtext' : 'STRING'
}
def BuildFeilds(table):
conn = Connect()
cursor = conn.cursor()
cursor.execute("DESCRIBE %s;" % table)
tableDecorator = cursor.fetchall()
fields = []
for col in tableDecorator:
field = {}
field["name"] = col[0]
colType = col[1].split("(")[0]
if colType not in bqTypeDict:
logging.warning("Unknown type detected, using string: %s", str(col[1]))
field["type"] = bqTypeDict.get(colType, "STRING")
if col[2] == "YES":
field["mode"] = "NULLABLE"
fields.append(field)
return fields
def SQLToBQBatch(table, limit=3000):
logging.info("****************************************************")
logging.info("Starting SQLToBQBatch. Got: Table: %s, Limit: %i" % (table, limit))
bqDest = GetBuilder()
fields = BuildFeilds(table)
try:
responce = bqDest.datasets().insert(projectId=PROJECT_ID, body={'datasetReference' :
{'datasetId' : DATASET_ID} }).execute()
logging.info("Added Dataset")
logging.info(responce)
except Exception, e:
logging.info(e)
if ("Already Exists: " in str(e)):
logging.info("Dataset already exists")
else:
logging.error("Error creating dataset: " + str(e), "Error")
try:
responce = bqDest.tables().insert(projectId=PROJECT_ID, datasetId=DATASET_ID, body={'tableReference' : {'projectId' : PROJECT_ID,
'datasetId' : DATASET_ID,
'tableId' : TABLE_ID},
'schema' : {'fields' : fields}}
).execute()
logging.info("Added Table")
logging.info(responce)
except Exception, e:
logging.info(e)
if ("Already Exists: " in str(e)):
logging.info("Table already exists")
else:
logging.error("Error creating table: " + str(e), "Error")
conn = Connect()
cursor = conn.cursor()
logging.info("Starting load loop")
count = -1
cur_pos = 0
total = 0
batch_size = 1000
while count != 0 and cur_pos < limit:
count = 0
if batch_size + cur_pos > limit:
batch_size = limit - cur_pos
sqlCommand = "SELECT * FROM %s LIMIT %i, %i" % (table, cur_pos, batch_size)
logging.info("Running: %s", sqlCommand)
cursor.execute(sqlCommand)
data = []
for _, row in enumerate(cursor.fetchall()):
data.append({"json": RowToJSON(cursor, row, fields)})
count += 1
logging.info("Read complete")
if count != 0:
logging.info("Sending request")
insertResponse = bqDest.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body={"rows":data}).execute()
cur_pos += batch_size
total += count
logging.info("Done %i, Total: %i, Response: %s", count, total, insertResponse)
if "insertErrors" in insertResponse:
logging.error("Error inserting data index: %i", insertResponse["insertErrors"]["index"])
for error in insertResponse["insertErrors"]["errors"]:
logging.error(error)
else:
logging.info("No more rows")
导入MySQLdb
导入日志记录
从apiclient.discovery导入生成
从oauth2client.appengine导入AppAssertionCredentials
导入httplib2
OAUTH_示波器https://www.googleapis.com/auth/bigquery'
项目ID=
数据集\u ID=
表_ID=
SQL\u数据库\u名称=
SQL_数据库_数据库=
SQL\u用户=
SQL_PASS=
def Connect():
返回MySQLdb.connect(unix\u socket='/cloudsql/'+SQL\u DATABASE\u NAME,db=SQL\u DATABASE\u db,user=SQL\u user,passwd=SQL\u PASS)
def RowToJSON(光标、行、字段):
newData={}
对于i,枚举中的值(行):
尝试:
如果字段[i][“type”]==bqTypeDict[“int”]:
value=int(值)
其他:
值=浮动(值)
除:
如果值不是“无”:
值=值。替换(“\x92”和“”)\
.替换(“\x96”和“”)\
.替换(“\x93”和“”)\
.替换(“\x94”和“”)\
.替换(“\x97”和“-”)\
.replace(“\xe9”和“e”)\
.替换(“\x91”和“”)\
.替换(“\x85”和“…”)\
.替换(“\xb4”和“”)\
.replace(“,”替换)
newData[cursor.description[i][0]]=值
返回新数据
def GetBuilder():
返回生成('bigquery','v2',http=AppAssertionCredentials(scope=OAUTH_scope).authorize(httplib2.http())
bqTypeDict={'int':'INTEGER',
'varchar':'STRING',
“double”:“FLOAT”,
'tinyint':'INTEGER',
'十进制':'浮点',
“文本”:“字符串”,
'smallint':'INTEGER',
'字符':'字符串',
‘bigint’:‘INTEGER’,
'浮动':'浮动',
“longtext”:“STRING”
}
def BuildFeilds(表):
conn=Connect()
游标=连接游标()
cursor.execute(“描述%s;%table”)
tableDecorator=cursor.fetchall()
字段=[]
对于col in tableDecorator:
字段={}
字段[“名称”]=列[0]
colType=col[1]。拆分(“”[0]
如果colType不在bqTypeDict中:
logging.warning(“检测到未知类型,使用字符串:%s”,str(列[1]))
字段[“type”]=bqTypeDict.get(colType,“STRING”)
如果列[2]=“是”:
字段[“模式”]=“可为空”
字段。追加(字段)
返回字段
def SQLToBQBatch(表格,限制=3000):
logging.info(“*************************************************************”)
logging.info(“启动SQLToBQBatch.get:表:%s,限制:%i”%(表,限制))
bqDest=GetBuilder()
字段=BuildFeilds(表)
尝试:
response=bqDest.datasets().insert(projectd=PROJECT_ID,body={'datasetReference':
{'datasetId':数据集\u ID}).execute()
logging.info(“添加的数据集”)
logging.info(responce)
除例外情况外,e:
logging.info(e)
如果(“已存在:”在str(e)中):
logging.info(“数据集已存在”)
其他:
洛金