Dataframe 如何将一行返回到数据帧中_Dataframe_Pyspark_Azure Databricks

Dataframe 如何将一行返回到数据帧中

dataframe pyspark

Dataframe 如何将一行返回到数据帧中,dataframe,pyspark,azure-databricks,Dataframe,Pyspark,Azure Databricks,这被认为是将我的数据帧的第一行移动到新数据帧中的简单测试 first issue df.first返回一行而不是数据帧。下一个问题，当我尝试使用spark.createDataFramedf.first时，它会告诉您它无法推断模式下一个问题spark.createDataFramedf.first，df.schema不起作用因此，对于下面的原始模式： root |-- entity_name: string (nullable = true) |-- field_name: array

这被认为是将我的数据帧的第一行移动到新数据帧中的简单测试

first issue df.first返回一行而不是数据帧。下一个问题，当我尝试使用spark.createDataFramedf.first时，它会告诉您它无法推断模式

下一个问题spark.createDataFramedf.first，df.schema不起作用

因此，对于下面的原始模式：

root
 |-- entity_name: string (nullable = true)
 |-- field_name: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- data_row: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- data_schema: array (nullable = true)
 |    |-- element: string (containsNull = true)

我在代码中定义了模式，因此：

xyz_schema = StructType([
 StructField('entity_name',StringType(),True)
 ,StructField('field_name',ArrayType(StringType(),True),True)
 ,StructField('data_row',ArrayType(StringType(),True),True)
 ,StructField('data_schema',ArrayType(StringType(),True),True)
])

print(xyz.first())
xyz_1stRow = spark.createDataFrame(xyz.first(), xyz_schema)

上述方法不起作用！我得到以下错误：

"TypeError: StructType can not accept object 'parquet/assignment/v1' in type <class 'str'>"

这是印刷品告诉我的

Rowentity_name='parquet/assignment/v1'，字段名称=[“合同项目编号”、“UPC”、“DC\U ID”、“分配日期”， “AssignID”、“AssignmentQuantity”、“ContractNumber”、“MaterialNumber”， 'OrderReason'、'RequirementCategory'、'MSKU']，数据行=['\n 3501926604362962001,10/1/201984009248020191000,58400924801862291010711，V1\n\t\t\t\t'， “\n 1801914547738382001,10/1/201984009248020191000,68400924801791301010711，V1\n\t\t\t']，数据模式=['StringType'，'StringType'，'StringType'，无， “StringType”、“IntegerType”、“StringType”、“StringType”、“StringType”， “StringType”、“StringType”]

我做错了什么？为什么stringtype不接受字符串

我在使用Azure databricks的pyspark当前版本中工作。我更喜欢使用pyspark，而不是R，也不是Scala，并且不必转换成pandas，也不必冒着在所有这些语言之间转换时数据被破坏的风险。

根据文档，该函数获取RDD、list或pandas.DataFrame并从中创建数据帧。因此，您必须将df.first的结果放在括号中，使其成为一个列表。请看下面的示例：

df=spark.createDataFrame ['Galaxy'，2017年，27841年，17529年，《银河》，2017年，2939511892年， “诺瓦托”，2018年，3564422876， “Novato”，2018年，876554817]， [‘车型’、‘年份’、‘价格’、‘里程’] bla=spark.createDataFrame[df.first] 布拉秀输出：

+------+----+-----+-------+ 
| model|year|price|mileage| 
+------+----+-----+-------+ 
|Galaxy|2017|27841|  17529| 
+------+----+-----+-------+