Pyspark将数据从databricks写入azure sql:ValueError：推断后无法确定某些类型_Pyspark_Azure Databricks_Apache Arrow

Pyspark将数据从databricks写入azure sql:ValueError：推断后无法确定某些类型

pyspark

Pyspark将数据从databricks写入azure sql:ValueError：推断后无法确定某些类型,pyspark,azure-databricks,apache-arrow,Pyspark,Azure Databricks,Apache Arrow,我正在使用pyspark将数据从azure databricks写入azure sql。没有空值的代码运行良好，但当dataframe包含空值时，我会出现以下错误： databricks/spark/python/pyspark/sql/pandas/conversion.py:300: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.ena

我正在使用pyspark将数据从azure databricks写入azure sql。没有空值的代码运行良好，但当dataframe包含空值时，我会出现以下错误：

databricks/spark/python/pyspark/sql/pandas/conversion.py:300: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field Product. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Context: Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)

ValueError: Some of types cannot be determined after inferring

数据帧必须写入sql，包括空值。我如何解决这个问题

sqlContext = SQLContext(sc)

def to_sql(df, table):
  finaldf = sqlContext.createDataFrame(df)
  finaldf.write.jdbc(url=url, table= table, mode ="overwrite", properties = properties)

 to_sql(data, f"TF_{table.upper()}")

编辑：

解决了这个问题，创建了一个将数据类型映射到sql数据类型并将列和数据类型作为一个字符串输出的函数

def convert_dtype(df):
    df_mssql = {'int64': 'bigint', 'object': 'varchar(200)', 'float64': 'float'}
    mydict = {}
    for col in df.columns:
        if str(df.dtypes[col]) in df_mssql:
            mydict[col] = df_mssql.get(str(df.dtypes[col]))
    l = " ".join([str(k[0] + " " + k[1] + ",") for k in list(mydict.items())])
    return l[:-1]

将此字符串传递到

createTableColumnTypes

选项解决了此问题

jdbcDF.write \
    .option("createTableColumnTypes", convert_dtype(df) \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

为此，需要在write语句中指定模式。以下是文档中的一个示例，链接如下：

jdbcDF.write \
    .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

嗨，谢谢你的回答。我编写了一个小函数，将pandas数据类型映射到一个包含列和sql数据类型的字符串。将在我的帖子中编辑此内容。