Python 将BigQuery中的可空数据馈送到Tensorflow变换中_Python_Tensorflow_Google Bigquery_Apache Beam_Tensorflow Transform

Python 将BigQuery中的可空数据馈送到Tensorflow变换中

python tensorflow google-bigquery

Python 将BigQuery中的可空数据馈送到Tensorflow变换中,python,tensorflow,google-bigquery,apache-beam,tensorflow-transform,Python,Tensorflow,Google Bigquery,Apache Beam,Tensorflow Transform,我们正在尝试构建一个管道，在使用TensorFlow进行培训之前，从BigQuery获取数据，并通过TensorFlow转换运行管道已启动并正在运行，但我们在BigQuery中处理空值时遇到了困难我们使用Beam从BigQuery加载： raw_data = (pipeline | '{}_read_from_bq'.format(step) >> beam.io.Read( beam.io.BigQ

我们正在尝试构建一个管道，在使用TensorFlow进行培训之前，从BigQuery获取数据，并通过TensorFlow转换运行

管道已启动并正在运行，但我们在BigQuery中处理空值时遇到了困难

我们使用Beam从BigQuery加载：

    raw_data = (pipeline
                | '{}_read_from_bq'.format(step) >> beam.io.Read(
                    beam.io.BigQuerySource(query=source_query,
                                           use_standard_sql=True,
                                           )))

我正在使用数据集元数据，尝试对各种列使用

FixedLenFeature

和

VarLenFeature

：

    # Categorical feature schema
    categorical_features = {
        column_name: tf.io.FixedLenFeature([], tf.string) for column_name in categorical_columns
    }
    raw_data_schema.update(categorical_features)

    # Numerical feature schema
    numerical_features = {
        column_name: tf.io.VarLenFeature(tf.float32) for column_name in numerical_columns
    }
    raw_data_schema.update(numerical_features)

    # Create dataset_metadata given raw_data_schema
    raw_metadata = dataset_metadata.DatasetMetadata(
        schema_utils.schema_from_feature_spec(raw_data_schema))

正如预期的那样，如果您尝试将BigQuery NULL输入到

FixedLenFeature

，它就会中断

但是，当我尝试向字符串或整数提供

VarLenFeature

时，它也会中断。这似乎是因为VarLenFeature需要一个列表，而BigQuerySource提供了一个Python原语。确切的断点在这里（错误来自我尝试使用整数时）：

因此，似乎我需要将一个列表传递到VarLenFeature中，以使其正常工作，但BigQuerySource在默认情况下不会这样做

有没有一个简单的方法来实现这一点？还是我完全没有从BigQuery中读取可为空的列

提前非常感谢

您可能需要自己处理空（缺失）值。对于数值列，可以用平均值或中值替换空值。对于分类列（字符串），可以使用一些默认值，如空字符串或新值作为缺少的值指示符

我不太熟悉VarLenFeature，但是您可能可以在source_查询中替换NULL（NULL插补）。比如：

IFNULL（col，col_平均值）作为col_插补值

缺点是，您必须首先使用sql计算col_mean，并在此处将其作为常量填充。另一件事是，你需要记住这个平均值，并在预测中应用相同的平均值，因为它不是tf.transform（你的图）的一部分

Bigquery本身将BQML作为一个ML平台。他们确实支持。也许你也可以看看：）

非常感谢你的回复！我们将此作为一个备用选项实施。我们的目标是在Tensorflow转换中实现插补，因为这似乎是一个预期的用例，至少从“普查”的例子来看是这样的：看看他们如何处理他们的

可选的\u数字\u功能\u键

-但在这种情况下，它似乎对我们不起作用。。。我们将在BQ中实施插补，直到我们有时间解决VarLenFeature问题：）我们已经在一些实验和一些进一步反馈的基础上达成妥协。（1）如果可以在BQ中使用您提到的注意事项进行插补，我们使用以下方法：

IFNULL（…）

在SQL中，以及

FixedLenFeature

。（2）当我们需要TFT进行插补时，我们对BQ中的列进行操作以转换为数组：

当col为NULL时，则为NULL，否则数组（选择col）结束为col

，并使用

VarLenFeature

。

File "/usr/local/lib/python3.7/site-packages/tensorflow_transform/impl_helper.py", line 157, in <listcomp>
indices = [range(len(value)) for value in values]
TypeError: object of type 'int' has no len()
[while running 'train_transform/AnalyzeDataset/ApplySavedModel[Phase0]/ApplySavedModel/ApplySavedModel']

SparseTensorValue(indices=[(0, 0), (0, 1)], values=['U', 'K'], dense_shape=(1, 2))