Airflow mysqltologlecloudstorageoperator意外失败

Airflow mysqltologlecloudstorageoperator意外失败,airflow,Airflow,我有以下代码: file_name = gcs_export_uri_template + '/' + TABLE_PREFIX + '_' + TABLE_NAME + '{}.json' #{} is required for the operator. if file is big it breakes it to more files as 1.json 2.json etc import_orders_op = MySqlToGoogleCloudStorageOperator(

我有以下代码:

file_name = gcs_export_uri_template + '/' + TABLE_PREFIX + '_' + TABLE_NAME + '{}.json'  #{} is required for the operator. if file is big it breakes it to more files as 1.json 2.json etc
import_orders_op = MySqlToGoogleCloudStorageOperator(
    task_id='import_orders',
    mysql_conn_id='sqlcon',
    google_cloud_storage_conn_id='gcpcon',
    provide_context=True,
    sql=""" SELECT * FROM {{ params.table_name }} WHERE orders_id > {{ params.last_imported_id }} AND orders_id < {{ ti.xcom_pull('get_max_order_id') }} limit 10 """,
    params={'last_imported_id': LAST_IMPORTED_ORDER_ID, 'table_name' :  TABLE_NAME},
    bucket=GCS_BUCKET_ID,
    filename=file_name,
    dag=dag) 
它失败于:

[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/bin/airflow", line 27, in <module>
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 392, in run
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1493, in _run_raw_task
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:     result = task_copy.execute(context=context)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/mysql_to_gcs.py", line 89, in execute
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:     files_to_upload = self._write_local_data_files(cursor)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/mysql_to_gcs.py", line 134, in _write_local_data_files
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     json.dump(row_dict, tmp_file_handle)
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     for chunk in iterable:
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     for chunk in _iterencode_dict(o, _current_indent_level):
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     yield _encoder(value)
[2018-10-08 09:09:38,833] {base_task_runner.py:98} INFO - Subtask: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 5: invalid start byte
我只能假设原因是带有{}.json的文件名,可能是因为它有太多的记录,需要拆分文件,而无法拆分

我正在运行气流1.9.0


这里有什么问题?

您的限制10恰好返回了10行清晰的ASCII编码。但是,较大的select返回的内容不是用UTF-8解码的。当我的MySQL连接没有额外设置时,我就有了这个

如果您根本没有附加项,请编辑您的连接,使附加项字段中有{charset:utf8}。如果有额外的,只需将该键值对添加到集合中


这应该为钩子用来检索记录的MySQL客户机建立编码,并且应该正确地开始解码。他们是否会写信给地面军事系统是留给你的一项练习。

额外的具体内容是什么?该表定义为:DEFAULT CHARSET=utf8 COLLATE=utf8\u unicode\u ciIt是一种为加载连接的钩子提供灵活kwargs的东西,或者只是对self进行注释……MySQLHook特别说明可以插入字符集,它做了两件事,一件是设置一个使用unicode的标志。我将mysql数据库设置为使用utf-8,但这并不意味着钩子正确解码utf-8,直到我设置了这个。但文档中并未真正显示:
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/bin/airflow", line 27, in <module>
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 392, in run
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2018-10-08 09:09:38,830] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 50, in wrapper
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1493, in _run_raw_task
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:     result = task_copy.execute(context=context)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/mysql_to_gcs.py", line 89, in execute
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:     files_to_upload = self._write_local_data_files(cursor)
[2018-10-08 09:09:38,831] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/mysql_to_gcs.py", line 134, in _write_local_data_files
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     json.dump(row_dict, tmp_file_handle)
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     for chunk in iterable:
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     for chunk in _iterencode_dict(o, _current_indent_level):
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:   File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
[2018-10-08 09:09:38,832] {base_task_runner.py:98} INFO - Subtask:     yield _encoder(value)
[2018-10-08 09:09:38,833] {base_task_runner.py:98} INFO - Subtask: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 5: invalid start byte